The Split-\(\widehat{R}\) statistic and the effective sample size (abbreviated as ESS or \(S_{\rm eff}\); previously called \(N_{\rm eff}\) or \(n_{\rm eff}\)) are routinely used to monitor the convergence of iterative simulations, which are omnipresent in Bayesian statistics in the form of Markov-Chain Monte-Carlo samples. The original \(\widehat{R}\) statistic (Gelman and Rubin, 1992; Brooks and Gelman, 1998) and split-\(\widehat{R}\) (Gelman et al., 2013) are both based on the ratio of between and within-chain marginal variances of the simulations, while the latter is computed from split chains (hence the name).
\(\widehat{R}\), split-\(\widehat{R}\), and \(S_{\rm eff}\) are well defined only if the marginal distributions have finite mean and variance. Even if that’s the case, their estimates are less stable for distributions with long tails. To alleviate these problems, we define split-\(\widehat{R}\) and \(S_{\rm eff}\) using rank normalized values, empirical cumulative density functions, and small posterior intervals.
The code for the new split-\(\widehat{R}\) and \(S_{\rm eff}\) versions and for the corresponding diagnostic plots can be found in monitornew.R and monitorplot.R, respectively.
In this section, we will review the split-\(\widehat{R}\) and effective sample size estimates as implemented in Stan 2.18 (Stan Development Team, 2018e). These implementations represent the current de facto standard of convergence diagnostics for iterative simulations.
Below, we present the computation of Split-\(\widehat{R}\) following Gelman et al. (2013), but using the notation style of Stan Development Team (2018a). Our general recommendation is to always run several chains. \(N\) is the number of draws per chain, \(M\) is the number of chains, and \(S=MN\) is the total number of draws from all chains. For each scalar summary of interest \(\theta\), we compute \(B\) and \(W\), the between- and within-chain variances:
\[ B = \frac{N}{M-1}\sum_{m=1}^{M}(\overline{\theta}^{(.m)}-\overline{\theta}^{(..)})^2, \;\mbox{ where }\;\;\overline{\theta}^{(.m)}=\frac{1}{N}\sum_{n=1}^N \theta^{(nm)},\;\; \;\;\overline{\theta}^{(..)} = \frac{1}{M}\sum_{m=1}^M\overline{\theta}^{(.m)} \\ W = \frac{1}{M}\sum_{m=1}^{M}s_j^2, \;\mbox{ where }\;\; s_m^2=\frac{1}{N-1} \sum_{n=1}^N (\theta^{(nm)}-\overline{\theta}^{(.m)})^2. \]
The between-chain variance, \(B\), also contains the factor \(N\) because it is based on the variance of the within-chain means, \(\overline{\theta}^{(.m)}\), each of which is an average of \(N\) values \(\theta^{(nm)}\).
We can estimate \(\mbox{var}(\theta \mid y)\), the marginal posterior variance of the estimand, by a weighted average of \(W\) and \(B\), namely \[ \widehat{\mbox{var}}^+(\theta \mid y)=\frac{N-1}{N}W + \frac{1}{N}B. \] This quantity overestimates the marginal posterior variance assuming the starting distribution of the simulations is appropriately overdispersed compared to the target distribution, but is unbiased under stationarity (that is, if the starting distribution equals the target distribution), or in the limit \(N\rightarrow\infty\). To have an overdispersed starting distribution, independent Markov chains should be initialized with diffuse starting values for the parameters.
Meanwhile, for any finite \(N\), the within-chain variance \(W\) should underestimate \(\mbox{var}(\theta \mid y)\) because the individual chains haven’t had the time to explore all of the target distribution and, as a result, will have less variability. In the limit as \(N\rightarrow\infty\), the expectation of \(W\) also approaches \(\mbox{var}(\theta \mid y)\).
We monitor convergence of the iterative simulations to the target distribution by estimating the factor by which the scale of the current distribution for \(\theta\) might be reduced if the simulations were continued in the limit \(N\rightarrow\infty\). This potential scale reduction is estimated as \[ \widehat{R}= \sqrt{\frac{\widehat{\mbox{var}}^+(\theta \mid y)}{W}}, \] which declines to 1 as \(N\rightarrow\infty\). We call this split-\(\widehat{R}\) because we are applying it to chains that have been split in half so that \(M\) is twice the number of actual chains. Without splitting, \(\widehat{R}\) would get fooled by non-stationary chains (see Appendix D).
We note that split-\(\widehat{R}\) is also well defined for sequences that are not Markov-chains. However, for simplicity, we always refer to ‘chains’ instead of more generally to ‘sequences’ as the former is our primary use case for \(\widehat{R}\)-like measures.
If the \(N\) simulation draws within each chain were truly independent, the between-chain variance \(B\) would be an unbiased estimate of the posterior variance, \(\mbox{var}(\theta \mid y)\), and we would have a total of \(S = MN\) independent simulations from the \(M\) chains. In general, however, the simulations of \(\theta\) within each chain will be autocorrelated, and thus \(B\) will be larger than \(\mbox{var}(\theta \mid y)\), in expectation.
A nice introductory reference for analyzing MCMC results in general and effective sample size in particular is provided by Geyer (2011, see also 1992). The particular calculations used by Stan (Stan Development Team, 2018e) follow those for split-\(\widehat{R}\), which involve both between-chain (mean) and within-chain calculations (autocorrelation). They were introduced in the Stan manual (Stan Development Team, 2018d) and explained in more detail in Gelman et al. (2013).
One way to define effective sample size for correlated simulation draws is to consider the statistical efficiency of the average of the simulations \(\bar{\theta}^{(..)}\) as an estimate of the posterior mean \(\mbox{E}(\theta \mid y)\). This generalizes also to posterior expectations of functionals of parameters \(\mbox{E}(g(\theta) \mid y)\) and we return later to how to estimate the effective sample size of quantiles which cannot be presented as expectations. For simplification, in this section we consider the effective sample size for the posterior mean.
The effective sample size of a chain is defined in terms of the autocorrelations within the chain at different lags. The autocorrelation \(\rho_t\) at lag \(t \geq 0\) for a chain with joint probability function \(p(\theta)\) with mean \(\mu\) and variance \(\sigma^2\) is defined to be \[ \rho_t = \frac{1}{\sigma^2} \, \int_{\Theta} (\theta^{(n)} - \mu) (\theta^{(n+t)} - \mu) \, p(\theta) \, d\theta. \] This is just the correlation between the two chains offset by \(t\) positions. Because we know \(\theta^{(n)}\) and \(\theta^{(n+t)}\) have the same marginal distribution in an MCMC setting, multiplying the two difference terms and reducing yields \[ \rho_t = \frac{1}{\sigma^2} \, \int_{\Theta} \theta^{(n)} \, \theta^{(n+t)} \, p(\theta) \, d\theta. \]
The effective sample size of one chain generated by a process with autocorrelations \(\rho_t\) is defined by \[ N_{\rm eff} \ = \ \frac{N}{\sum_{t = -\infty}^{\infty} \rho_t} \ = \ \frac{N}{1 + 2 \sum_{t = 1}^{\infty} \rho_t}. \]
Effective sample size \(N_{\rm eff}\) can be larger than \(N\) in case of antithetic Markov chains, which have negative autocorrelations on odd lags. The Dynamic Hamiltonian Monte-Carlo algorithms used in Stan (Hoffman and Gelman, 2014; Betancourt, 2017) can produce \(N_{\rm eff}>N\) for parameters with a close to Gaussian posterior (in the unconstrained space) and low dependency on other parameters.
In practice, the probability function in question cannot be tractably integrated and thus neither autocorrelation nor the effective sample size can be calculated. Instead, these quantities must be estimated from the samples themselves. The rest of this section describes an autocorrelation and split-\(\widehat{R}\) based effective sample size estimator, based on multiple split chains. For simplicity, each chain will be assumed to be of the same length \(N\).
Stan carries out the autocorrelation computations for all lags simultaneously using Eigen’s fast Fourier transform (FFT) package with appropriate padding; see Geyer (2011) for more details on using FFT for autocorrelation calculations. The autocorrelation estimates \(\hat{\rho}_{t,m}\) at lag \(t\) from multiple chains \(m \in (1,\ldots,M)\) are combined with the within-chain variance estimate \(W\) and the multi-chain variance estimate \(\widehat{\mbox{var}}^{+}\) introduced in the previous section to compute the combined autocorrelation at lag \(t\) as \[ \hat{\rho}_t = 1 - \frac{\displaystyle W - \textstyle \frac{1}{M}\sum_{m=1}^M \hat{\rho}_{t,j}}{\widehat{\mbox{var}}^{+}}. \label{rhohat} \] If the chains have not converged, the variance estimator \(\widehat{\mbox{var}}^{+}\) will overestimate the true marginal variance which leads to an overestimation of the autocorrelation and an underestimation of the effective sample size.
Because of noise in the correlation estimates \(\hat{\rho}_t\) increases as \(t\) increases, typically the truncated sum of \(\hat{\rho}_t\) is used. Negative autocorrelations can happen only on odd lags and by summing over pairs starting from lag \(t=0\), the paired autocorrelation is guaranteed to be positive, monotone and convex modulo estimator noise (Geyer, 1992, 2011). Stan 2.18 uses Geyer’s initial monotone sequence criterion. The effective sample size of combined chains is defined as \[ S_{\rm eff} = \frac{N \, M}{\hat{\tau}}, \] where \[ \hat{\tau} = 1 + 2 \sum_{t=1}^{2k+1} \hat{\rho}_t = -1 + 2 \sum_{t'=0}^{k} \hat{P}_{t'}, \] and \(\hat{P}_{t'}=\hat{\rho}_{2t'}+\hat{\rho}_{2t'+1}\). The initial positive sequence estimator is obtained by choosing the largest \(k\) such that \(\hat{P}_{t'}>0\) for all \(t' = 1,\ldots,k\). The initial monotone sequence estimator is obtained by further reducing \(\hat{P}_{t'}\) to the minimum of the preceding values so that the estimated sequence becomes monotone.
The effective sample size \(S_{\rm eff}\) described here is different from similar formulas in the literature in that we use multiple chains and between-chain variance in the computation, which typically gives us more conservative claims (lower values of \(S_{\rm eff}\)) compared to single chain estimates, especially when mixing of the chains is poor. If the chains are not mixing at all (e.g., the posterior is multimodal and the chains are stuck in different modes), then our \(S_{\rm eff}\) is close to the number of chains.
Before version 2.18, Stan used a slightly incorrect initial sequence which implied that \(S_{\rm eff}\) was capped at \(S\) and thus the effective sample size was underestimated for some models. As antithetic behavior (i.e., \(S_{\rm eff} > S\)) is not that common or the effect is small, and capping at \(S\) can be considered to be pessimistic, this had negligible effect on any inference. However, it may have led to an underestimation of Stan’s efficiency when compared to other packages performing MCMC sampling.
As split-\(\widehat{R}\), and \(S_{\rm eff}\) are well defined only if the marginal posteriors have finite mean and variance, we next introduce split-\(\widehat{R}\) and \(S_{\rm eff}\) using rank normalized values, empirical cumulative density functions, and small posterior intervals which are well defined for all distributions and more robust for long tailed distributions.
Rank normalized split-\(\widehat{R}\) is computed using the equations in Section Split-\(\widehat{R}\) by replacing the original parameter values \(\theta^{(nm)}\) with their corresponding rank normalized values \(z^{(nm)}\).
Rank normalization: 1. Rank: Replace each value \(\theta^{(nm)}\) by its rank \(r^{(nm)}\). Average rank for ties are used to conserve the number of unique values of discrete quantities. Ranks are computed jointly for all draws from all chains. 2. Normalize: Normalize ranks by inverse normal transformation \(z^{(nm)} = \phi^{-1}((r^{(nm)}-1/2)/S)\).
Appendix B illustrates the rank normalization of multiple chains.
For continuous variables and \(S \rightarrow \infty\), the rank normalized values are normally distributed and rank normalization performs non-parametric transformation to normal distribution. Using normalized ranks instead of ranks directly, has the benefit that the behavior of \(\widehat{R}\) and \(S_{\rm eff}\) do not change for normally distributed \(\theta\).
Both original and rank-normalized split-\(\widehat{R}\) can be fooled if chains have different scales but the same location (see Appendix D). To alleviate this problem, we propose to compute rank normalized folded-split-\(\widehat{R}\) using folded split chains by rank normalizing absolute deviations from median \[ {\rm abs}(\theta^{(nm)}-{\rm median}(\theta)). \]
To obtain a single conservative \(\widehat{R}\) estimate, we propose to report the maximum of rank normalized split-\(\widehat{R}\) and rank normalized folded-split-\(\widehat{R}\) for each parameter.
In addition to using rank-normalized values for convergence diagnostics via \(\widehat{R}\), we can also compute the corresponding effective sample size \(S_{\rm eff}\), using equations in Section Effective sample size by replacing parameter values \(\theta^{(nm)}\) with rank normalized values \(z^{(nm)}\). This estimate will be well defined even if the original distribution does not have finite mean and variance. It is not directly applicable to estimate the Monte Carlo error of the mean of the original values, but it will provide a bijective transformation-invariant estimate of the mixing efficiency of chains. For parameters with a close to normal distribution, the difference to using the original values is small. However, for parameters with a distribution far from normal, rank normalization can be seen as a near optimal non-parametric transformation.
The effective sample size using rank normalized values is a useful measure for efficiency of estimating the bulk (mean and quantiles near the median) of the distribution, and as shorthand term we use term bulk effective sample size (bulk-ESS). Bulk-ESS is also useful for diagnosing problems due to trends or different means of the chains (see Appendix D).
We propose to compute the effective sample size also using folded split chains by rank normalizing absolute deviations from median (see above), which is a useful measure for the efficiency of estimating the distributions’ tail. As a shorthand, we use the term tail effective sample size (tail-ESS). Tail-ESS is also useful for diagnosing problems due to different scales of the chains (see Appendix D).
The bulk- and tail-ESS introduced above are useful as overall efficiency measures. Next, we introduce effective sample size estimates for the cumulative distribution function (CDF), and later we use that to introduce efficiency diagnostics for quantiles and local small probability intervals.
Quantiles and posterior intervals derived on their basis are often reported quantities which are easy to estimate from posterior draws. Estimating the efficiency of such quantile estimates thus has a high practical relevance in particular as we observe the efficiency for tail quantiles to often be lower than for the mean or median. The \(\alpha\)-quantile is defined as the parameter value \(\theta_\alpha\) for which \(p(\theta \leq \theta_\alpha) = \alpha\). An estimate \(\hat{\theta}_\alpha\) of \(\theta_\alpha\) can thus be obtained by finding the \(\alpha\)-quantile of the empirical CDF (ECDF) of the posterior draws \(\theta^{(s)}\). However, quantiles cannot be written as an expectation, and thus the above equations for \(\widehat{R}\) and \(S_{\rm eff}\) are not directly applicable. Thus, we first focus on the efficiency estimate for the cumulative probability \(p(\theta \leq \theta_\alpha)\) for different values of \(\theta_\alpha\).
For any \(\theta_\alpha\), the ECDF gives an estimate of the cumulative probability \[ p(\theta \leq \theta_\alpha) \approx \bar{I}_\alpha = \frac{1}{S}\sum_{s=1}^S I(\theta^{(s)} \leq\theta_\alpha), \] where \(I()\) is the indicator function. The indicator function transforms simulation draws to 0’s and 1’s, and thus the subsequent computations are bijectively invariant. Efficiency estimates of the ECDF at any \(\theta_\alpha\) can now be obtained by applying rank-normalizing and subsequent computations directly on the indicator function’s results. See Appendix C for an illustration of variance of ECDF.
Assuming that we know the CDF to be a certain continuous function \(F\) which is smooth near an \(\alpha\)-quantile of interest, we could use the delta method to compute a variance estimate for \(F^{-1}(\bar{I}_\alpha)\). Although we don’t usually know \(F\), the delta method approach reveals that the variance of \(\bar{I}_\alpha\) for some \(\theta_\alpha\) is scaled by the (usually unknown) density \(f(\theta_\alpha)\), but the efficiency depends only on the efficiency of \(\bar{I}_\alpha\). Thus, we can use the effective sample size for the ECDF (we computed using the indicator function \(I(\theta^{(s)} \leq \theta_\alpha)\)) also for the corresponding quantile estimates.
Since the marginal posterior distributions might not have finite mean and variance, by default RStan (Stan Development Team, 2018c) and RStanARM (Stan Development Team, 2018b) report median and median absolute deviation (MAD) instead of mean and standard error (SE). Median and MAD are well defined even when the marginal distribution does not have finite mean and variance. Since the median is just 50%-quantile, we can get efficiency estimate for it as for any other quantile.
We can also compute the efficiency estimate for the median absolute deviation (MAD), by computing the efficiency estimate for the median of absolute deviations from the median of all draws. The absolute deviations from the median of all draws are same as previously defined for folded samples \[ {\rm abs}(\theta^{(nm)}-{\rm median}(\theta)). \] We see that the efficiency estimate for MAD is obtained by using the same approach as for the median (and other quantiles) but with the folded values also used in rank-normalized-folded-split-\(S_{\rm eff}\).
Previously, Stan has reported Monte Carlo standard error estimates for the mean of a quantity. This is valid only if the corresponding marginal distribution has finite mean and variance; and even if valid, it may be easier and more robust to focus on the median and other quantiles, instead.
Median, MAD and quantiles are well defined even when the distribution does not have finite mean and variance, and they are asymptotically normal for continuous distributions which are non-zero in the relevant neighborhood. As the delta method for computing the variance would require explicit knowledge of the normalized posterior density, we propose the following alternative approach:
We can get more local efficiency estimates by considering small probability intervals. We propose to compute the efficiency estimates for \[ \bar{I}_{\alpha,\delta} = p(\hat{Q}_\alpha < \theta \leq \hat{Q}_{\alpha+\delta}), \] where \(\hat{Q}_\alpha\) is an empirical \(\alpha\)-quantile, \(\delta=1/k\) is the length of the interval with some positive integer \(k\), and \(\alpha \in (0,\delta,\ldots,1-\delta)\) changes in steps of \(\delta\). Each interval has \(S/k\) draws, and the efficiency measures autocorrelation of an indicator function which is \(1\) when the values are inside the specific interval and \(0\) otherwise. This gives us a local efficiency measure which does not depend on the shape of the distribution.
In addition to using rank-normalized values to compute split-\(\widehat{R}\), we propose to use rank plots for each chain instead of trace plots. Rank plots are nothing else than histograms of the ranked posterior samples (ranked over all chains) plotted separately for each chain. If rank plots of all chains look similar, this indicates good mixing of the chains. As compared to trace plots, rank plots don’t tend to squeeze to a mess in case of long chains.
The proposal is to switch in Stan:
Justifications for the changes are:
In summary outputs, we propose to use Rhat to denote also the new version. As \(n\) and \(N\) often refer to the number of observations, we propose to use acronym ESS for the effective sample size.
We propose to add to the bayesplot package:
Based on the experiments presented in Appendices D-F, more strict convergence diagnostics and effective sample size warning limits could be used. We propose the following warning thresholds although additional experiments would be useful:
In case of running 4 chains, an effective sample size of 400, corresponds to having an effective sample size of 50 for each 8 split chains, which we consider to be minimum for reliable mean, variance and autocorrelation estimates needed for the convergence diagnostic. We recommend running at least 4 chains to get reliable between chain variances for the convergence diagnostics.
Plots shown in the upcoming sections have dashed lines based on these thresholds (in continuous plots, a dashed line at 1.005 is plotted instead of 1.01, as values larger than that are usually rounded in our summaries to 1.01).
In this section, we will go through some examples to demonstrate the usefulness of our proposed methods as well as the associated workflow in determining convergence. Appendices D-G contain more detailed analysis of different algorithm variants and further examples.
First, we load all the necessary R packages and additional functions.
library(tidyverse)
library(gridExtra)
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
library(bayesplot)
theme_set(bayesplot::theme_default(base_family = "sans"))
source('monitornew.R')
source('monitorplot.R')
The classic split-\(\widehat{R}\) is based on calculating within and between chain variances. If the marginal distribution of a chain is such that the variance is not defined (i.e., infinite), the classic split-\(\widehat{R}\) is not well justified. In this section, we will use the Cauchy distribution as an example of such a distribution. Appendix E contains more detailed analysis of different algorithm variants and further Cauchy examples.
The following Cauchy models are from Michael Betancourt’s case study Fitting The Cauchy Distribution
The nominal Cauchy model with direct parameterization is as follows:
writeLines(readLines("cauchy_nom.stan"))
parameters {
vector[50] x;
}
model {
x ~ cauchy(0, 1);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the nominal model:
fit_nom <- stan(file = 'cauchy_nom.stan', seed = 7878, refresh = 0)
Warning: There were 1421 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems
Treedepth exceedence and Bayesian Fraction of Missing Information are dynamic HMC specific diagnostics (Betancourt, 2017). We get warnings about a very large number of transitions after warmup that exceeded the maximum treedepth, which is likely due to very long tails of the Cauchy distribution. All chains have low estimated Bayesian fraction of missing information also indicating slow mixing.
mon <- monitor(fit_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
Q5 Q50 Q95 SE_Q5 SE_Q50 SE_Q95 Rhat Bulk_ESS Tail_ESS
x[1] -5.87 0.02 6.78 0.94 0.03 1.58 1.01 1972 365
x[2] -5.76 -0.03 5.11 0.72 0.03 0.71 1.00 2411 456
x[3] -6.12 0.00 7.73 1.09 0.03 2.54 1.01 440 268
x[4] -5.30 -0.01 5.41 0.73 0.03 0.66 1.01 2263 585
x[5] -13.14 -0.04 5.83 13.74 0.04 0.86 1.03 167 194
x[6] -8.10 -0.06 6.39 1.80 0.02 0.94 1.01 1603 409
x[7] -7.25 -0.08 7.55 1.28 0.03 1.65 1.01 1220 357
x[8] -4.79 0.02 6.23 0.75 0.02 1.30 1.02 1559 333
x[9] -4.69 0.01 4.82 0.58 0.03 0.61 1.00 2603 606
x[10] -8.43 -0.04 6.50 3.08 0.03 0.90 1.01 1599 246
x[11] -6.21 0.02 7.09 0.62 0.02 1.26 1.01 2458 401
x[12] -5.55 0.02 6.62 0.74 0.03 1.29 1.00 2830 386
x[13] -6.16 0.00 6.13 1.13 0.02 0.92 1.01 2277 365
x[14] -5.45 0.02 6.79 0.85 0.03 0.98 1.01 3118 417
x[15] -8.17 0.05 9.55 1.50 0.03 2.41 1.01 2121 307
x[16] -5.83 0.00 6.23 0.91 0.03 1.14 1.01 2171 449
x[17] -18.10 -0.07 7.39 18.25 0.05 2.81 1.03 207 170
x[18] -5.89 0.05 5.90 0.78 0.02 0.81 1.01 3409 418
x[19] -7.04 0.03 6.23 1.54 0.02 1.14 1.01 1422 401
x[20] -6.58 0.03 7.53 1.74 0.03 1.71 1.02 947 424
x[21] -8.11 0.01 6.57 2.31 0.03 1.05 1.00 1785 377
x[22] -5.47 0.04 5.24 0.76 0.03 0.88 1.00 2836 536
x[23] -10.04 -0.05 6.55 10.12 0.04 1.37 1.04 314 123
x[24] -13.18 -0.07 5.72 9.31 0.03 0.84 1.01 431 216
x[25] -6.89 -0.02 5.60 1.13 0.03 0.65 1.02 1888 272
x[26] -5.33 -0.02 4.69 0.58 0.02 0.49 1.01 3095 453
x[27] -5.19 -0.04 5.17 0.66 0.02 0.66 1.00 3113 473
x[28] -6.81 -0.02 6.81 0.65 0.02 0.87 1.00 4186 528
x[29] -9.96 -0.02 6.99 3.93 0.03 1.00 1.01 926 303
x[30] -6.20 0.02 6.62 1.22 0.03 1.19 1.01 3500 278
x[31] -18.84 -0.04 5.77 22.16 0.04 0.69 1.02 208 130
x[32] -6.33 -0.01 6.21 0.97 0.02 0.86 1.02 3044 452
x[33] -4.81 0.03 4.93 0.45 0.02 0.51 1.01 3504 525
x[34] -5.75 0.02 5.75 0.83 0.02 1.14 1.00 1889 522
x[35] -5.98 0.03 7.16 1.09 0.02 1.26 1.01 2701 453
x[36] -7.37 0.00 9.19 1.37 0.02 2.44 1.01 1404 366
x[37] -5.03 0.02 7.82 0.76 0.02 2.00 1.01 2020 336
x[38] -5.40 -0.01 5.59 0.66 0.02 0.70 1.01 3644 525
x[39] -4.78 -0.02 5.36 0.51 0.02 0.85 1.00 2849 517
x[40] -6.16 0.00 4.58 1.53 0.02 0.53 1.01 1843 462
x[41] -5.46 0.07 6.55 1.02 0.03 0.88 1.01 2322 438
x[42] -7.63 -0.05 6.38 1.22 0.02 0.63 1.01 1786 300
x[43] -7.47 -0.04 6.39 1.76 0.03 0.91 1.01 1871 330
x[44] -5.22 0.00 5.17 0.71 0.02 0.76 1.01 1613 476
x[45] -4.44 0.04 13.98 0.45 0.04 16.35 1.01 295 219
x[46] -5.04 0.01 5.60 0.98 0.02 0.91 1.02 3151 293
x[47] -5.87 0.03 5.46 1.11 0.03 0.89 1.02 2247 508
x[48] -4.99 -0.03 4.91 0.65 0.02 0.63 1.01 2914 499
x[49] -6.42 0.01 10.78 0.73 0.03 8.28 1.02 223 187
x[50] -6.24 0.02 5.71 1.09 0.03 0.66 1.00 2610 499
I 0.00 1.00 1.00 0.00 NA NA 1.00 683 683
lp__ -88.54 -68.47 -50.77 1.36 0.78 0.71 1.01 249 655
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
which_min_ess <- which.min(mon[1:50, 'Bulk_ESS'])
Several Rhat > 1.01 and some ESS < 400 indicate that the results should not be trusted. The Appendix E has more results with longer chains as well.
We can further analyze potential problems using local efficiency and rank plots. We specifically investigate x[5], which has the smallest bulk-ESS 167.
We examine the sampling efficiency in different parts of the posterior by computing the efficiency of small interval probability estimates (see Section Efficiency estimate for small interval probability estimates). Each interval contains \(1/k\) of the draws (e.g., \(5\%\) each, if \(k=20\)). The small interval efficiency measures mixing of an function which indicates when the values are inside or outside the specific small interval. As detailed above, this gives us a local efficiency measure which does not depend on the shape of the distribution.
plot_local_ess(fit = fit_nom, par = which_min_ess, nalpha = 20)
We see that the efficiency of our posterior draws is worryingly low in the tails (which is caused by slow mixing in long tails of Cauchy). Orange ticks show iterations that exceeded the maximum treedepth.
An alternative way to examine the efficiency in different parts of the posterior is to compute efficiencies estimates for quantiles (see Section Efficiency for quantiles). Each interval has a specified proportion of draws, and the efficiency measures mixing of a function which indicates when the values are smaller than or equal to the corresponding quantile.
plot_quantile_ess(fit = fit_nom, par = which_min_ess, nalpha = 40)
Similar as above, we see that the efficiency of our posterior draws is worryingly low in the tails. Again, orange ticks show iterations that exceeded the maximum treedepth.
We may also investigate how the estimated effective sample sizes change when we use more and more draws. If the effective sample size is highly unstable, does not increase proportionally with more draws, or even decreases, this indicates that simply running longer chains will likely not solve the convergence issues. In the plot below, we see how unstable both bulk-ESS and tail-ESS are for this example.
plot_change_ess(fit = fit_nom, par = which_min_ess)
We can further analyze potential problems using rank plots in which we clearly see differences between chains.
samp <- as.array(fit_nom)
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
Next we examine an alternative parameterization that considers the Cauchy distribution as a scale mixture of Gaussian distributions. The model has two parameters and the Cauchy distributed \(x\)’s can be computed from those. In addition to improved sampling performance, the example illustrates that focusing on diagnostics matters.
writeLines(readLines("cauchy_alt_1.stan"))
parameters {
vector[50] x_a;
vector<lower=0>[50] x_b;
}
transformed parameters {
vector[50] x = x_a ./ sqrt(x_b);
}
model {
x_a ~ normal(0, 1);
x_b ~ gamma(0.5, 0.5);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the alternative model:
fit_alt1 <- stan(file = 'cauchy_alt_1.stan', seed = 7878, refresh = 0)
There are no warnings, and the sampling is much faster.
mon <- monitor(fit_alt1)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
Q5 Q50 Q95 SE_Q5 SE_Q50 SE_Q95 Rhat Bulk_ESS Tail_ESS
x_a[1] -1.68 -0.03 1.60 0.04 0.02 0.03 1 4260 2212
x_a[2] -1.59 -0.01 1.71 0.04 0.02 0.04 1 3650 2361
x_a[3] -1.62 0.00 1.61 0.05 0.02 0.03 1 4090 2409
x_a[4] -1.66 -0.02 1.58 0.03 0.02 0.05 1 4145 2464
x_a[5] -1.69 -0.01 1.72 0.03 0.02 0.04 1 4573 2190
x_a[6] -1.67 0.01 1.71 0.05 0.02 0.04 1 4280 2165
x_a[7] -1.69 0.03 1.71 0.04 0.02 0.04 1 4071 2208
x_a[8] -1.61 0.01 1.64 0.04 0.02 0.03 1 3998 2417
x_a[9] -1.71 0.02 1.64 0.04 0.02 0.04 1 3866 2292
x_a[10] -1.63 -0.01 1.60 0.04 0.02 0.04 1 3981 2265
x_a[11] -1.60 -0.04 1.61 0.04 0.02 0.04 1 4217 2585
x_a[12] -1.64 0.01 1.64 0.04 0.02 0.03 1 3669 2309
x_a[13] -1.70 -0.01 1.68 0.05 0.02 0.03 1 3886 2230
x_a[14] -1.69 0.03 1.66 0.04 0.02 0.03 1 3963 2477
x_a[15] -1.63 -0.01 1.67 0.03 0.02 0.03 1 3741 2361
x_a[16] -1.65 -0.01 1.65 0.04 0.02 0.03 1 4344 2400
x_a[17] -1.66 -0.02 1.67 0.04 0.02 0.04 1 4345 2301
x_a[18] -1.56 0.01 1.55 0.06 0.02 0.04 1 3917 2254
x_a[19] -1.61 -0.02 1.65 0.05 0.02 0.03 1 4046 2475
x_a[20] -1.64 0.01 1.72 0.03 0.03 0.04 1 4215 2178
x_a[21] -1.62 -0.01 1.58 0.05 0.02 0.04 1 4314 2338
x_a[22] -1.61 0.00 1.61 0.04 0.02 0.04 1 3891 2256
x_a[23] -1.69 -0.02 1.62 0.04 0.02 0.04 1 3873 2213
x_a[24] -1.61 -0.02 1.61 0.05 0.02 0.04 1 3485 2398
x_a[25] -1.65 0.01 1.64 0.04 0.01 0.04 1 4311 2283
x_a[26] -1.60 0.05 1.64 0.03 0.02 0.04 1 3960 2316
x_a[27] -1.64 -0.02 1.66 0.05 0.02 0.05 1 3584 2384
x_a[28] -1.62 0.01 1.59 0.03 0.02 0.03 1 3899 2299
x_a[29] -1.60 0.02 1.58 0.04 0.02 0.03 1 4387 2336
x_a[30] -1.65 0.02 1.70 0.04 0.02 0.06 1 4316 2086
x_a[31] -1.64 0.00 1.64 0.05 0.02 0.04 1 3863 2447
x_a[32] -1.67 -0.01 1.65 0.04 0.02 0.03 1 4433 2006
x_a[33] -1.65 -0.02 1.59 0.04 0.01 0.05 1 4236 2199
x_a[34] -1.67 0.02 1.69 0.04 0.02 0.04 1 4170 2356
x_a[35] -1.62 0.00 1.58 0.03 0.02 0.04 1 3751 2261
x_a[36] -1.68 0.01 1.65 0.04 0.02 0.03 1 3896 2472
x_a[37] -1.60 0.03 1.66 0.04 0.02 0.04 1 4487 2170
x_a[38] -1.69 0.01 1.72 0.03 0.02 0.03 1 4030 2117
x_a[39] -1.64 -0.01 1.66 0.04 0.02 0.04 1 4407 2290
x_a[40] -1.69 -0.02 1.60 0.03 0.03 0.04 1 4144 2466
x_a[41] -1.64 0.02 1.65 0.04 0.02 0.03 1 4252 2232
x_a[42] -1.65 -0.02 1.61 0.03 0.03 0.03 1 3663 2191
x_a[43] -1.62 -0.02 1.64 0.05 0.02 0.03 1 3836 2350
x_a[44] -1.65 0.02 1.76 0.02 0.02 0.05 1 3758 2122
x_a[45] -1.69 0.04 1.71 0.04 0.03 0.03 1 3855 2361
x_a[46] -1.67 -0.01 1.70 0.04 0.02 0.05 1 4204 2216
x_a[47] -1.62 -0.03 1.65 0.04 0.02 0.03 1 3967 2183
x_a[48] -1.64 0.01 1.67 0.04 0.02 0.03 1 4374 2427
x_a[49] -1.64 0.02 1.64 0.03 0.02 0.03 1 3825 2414
x_a[50] -1.65 -0.01 1.63 0.04 0.02 0.05 1 3746 2027
x_b[1] 0.00 0.41 3.83 0.00 0.01 0.10 1 2374 4544
x_b[2] 0.00 0.47 3.73 0.00 0.02 0.10 1 2243 4851
x_b[3] 0.00 0.47 3.76 0.00 0.02 0.09 1 2860 3941
x_b[4] 0.01 0.49 3.95 0.00 0.02 0.14 1 2753 4043
x_b[5] 0.00 0.49 3.89 0.00 0.02 0.14 1 3378 3523
x_b[6] 0.00 0.46 3.81 0.00 0.02 0.09 1 2619 4137
x_b[7] 0.01 0.47 4.07 0.00 0.02 0.17 1 2831 3913
x_b[8] 0.01 0.45 3.75 0.00 0.02 0.13 1 2560 4203
x_b[9] 0.00 0.44 3.83 0.00 0.02 0.12 1 2785 4907
x_b[10] 0.00 0.48 3.78 0.00 0.02 0.10 1 2691 4165
x_b[11] 0.00 0.45 3.87 0.00 0.02 0.14 1 2809 4034
x_b[12] 0.00 0.42 3.94 0.00 0.02 0.15 1 2549 4081
x_b[13] 0.00 0.45 3.94 0.00 0.02 0.11 1 2504 4336
x_b[14] 0.00 0.44 3.71 0.00 0.02 0.10 1 2963 4096
x_b[15] 0.00 0.43 3.98 0.00 0.02 0.08 1 2545 4977
x_b[16] 0.01 0.49 3.62 0.00 0.02 0.10 1 2419 4123
x_b[17] 0.00 0.47 3.86 0.00 0.02 0.13 1 2424 3904
x_b[18] 0.00 0.46 3.79 0.00 0.02 0.16 1 2174 4232
x_b[19] 0.00 0.44 3.81 0.00 0.02 0.11 1 2982 4400
x_b[20] 0.00 0.48 3.61 0.00 0.02 0.11 1 3157 4003
x_b[21] 0.00 0.47 3.74 0.00 0.02 0.13 1 2889 4321
x_b[22] 0.00 0.45 3.85 0.00 0.02 0.10 1 2439 4235
x_b[23] 0.00 0.46 4.02 0.00 0.02 0.11 1 2825 3846
x_b[24] 0.00 0.46 3.79 0.00 0.02 0.14 1 2549 4138
x_b[25] 0.00 0.49 3.84 0.00 0.02 0.14 1 2510 4145
x_b[26] 0.00 0.44 4.08 0.00 0.02 0.13 1 3282 4066
x_b[27] 0.00 0.43 3.75 0.00 0.02 0.13 1 2441 4471
x_b[28] 0.00 0.45 3.77 0.00 0.02 0.11 1 2631 3893
x_b[29] 0.00 0.42 3.80 0.00 0.02 0.13 1 2566 4293
x_b[30] 0.00 0.45 3.82 0.00 0.02 0.15 1 2520 4099
x_b[31] 0.00 0.43 3.84 0.00 0.02 0.13 1 2811 4075
x_b[32] 0.00 0.45 3.75 0.00 0.02 0.14 1 2410 4279
x_b[33] 0.01 0.46 3.81 0.00 0.02 0.14 1 2323 4604
x_b[34] 0.00 0.46 3.89 0.00 0.02 0.12 1 2860 4064
x_b[35] 0.00 0.41 3.71 0.00 0.02 0.15 1 2534 4481
x_b[36] 0.00 0.44 3.60 0.00 0.02 0.13 1 2480 5072
x_b[37] 0.00 0.47 3.82 0.00 0.02 0.11 1 2791 4225
x_b[38] 0.00 0.47 4.12 0.00 0.02 0.16 1 2643 3433
x_b[39] 0.00 0.45 3.96 0.00 0.02 0.11 1 2592 4300
x_b[40] 0.00 0.45 3.68 0.00 0.02 0.15 1 2457 4171
x_b[41] 0.00 0.48 3.76 0.00 0.02 0.16 1 2768 4402
x_b[42] 0.00 0.46 3.95 0.00 0.03 0.10 1 2261 4438
x_b[43] 0.00 0.48 3.85 0.00 0.02 0.11 1 2516 4121
x_b[44] 0.00 0.46 3.50 0.00 0.02 0.11 1 2702 3998
x_b[45] 0.00 0.45 4.01 0.00 0.02 0.09 1 2873 4339
x_b[46] 0.00 0.45 3.94 0.00 0.03 0.10 1 2482 4039
x_b[47] 0.00 0.49 3.75 0.00 0.02 0.14 1 2541 3904
x_b[48] 0.00 0.46 3.79 0.00 0.02 0.11 1 2345 4076
x_b[49] 0.00 0.43 3.85 0.00 0.02 0.15 1 2497 4386
x_b[50] 0.00 0.46 3.78 0.00 0.02 0.13 1 2774 4171
x[1] -7.64 -0.04 7.33 0.53 0.03 0.95 1 3894 2029
x[2] -6.02 -0.01 6.41 0.57 0.02 0.58 1 3522 1691
x[3] -6.54 0.00 6.33 0.48 0.02 0.62 1 3496 2302
x[4] -5.52 -0.02 5.74 0.44 0.03 0.35 1 3774 2057
x[5] -6.20 -0.01 6.29 0.66 0.02 0.41 1 4132 2478
x[6] -5.55 0.01 6.40 0.44 0.03 0.59 1 4172 2123
x[7] -5.73 0.05 6.63 0.38 0.03 0.61 1 3751 2144
x[8] -5.80 0.01 5.58 0.53 0.03 0.52 1 3654 1991
x[9] -5.96 0.03 6.01 0.53 0.02 0.38 1 3568 2126
x[10] -5.73 -0.01 6.11 0.60 0.02 0.63 1 3612 2183
x[11] -6.35 -0.05 6.08 0.49 0.03 0.55 1 3932 2355
x[12] -6.59 0.02 5.94 0.52 0.03 0.44 1 3390 2008
x[13] -6.36 -0.01 6.24 0.50 0.02 0.51 1 3898 2061
x[14] -6.64 0.04 6.51 0.59 0.03 0.57 1 3706 2446
x[15] -7.28 -0.01 5.90 0.89 0.03 0.60 1 3695 1821
x[16] -5.65 -0.01 5.28 0.52 0.03 0.46 1 3940 1944
x[17] -6.10 -0.03 6.17 0.53 0.02 0.48 1 3999 2387
x[18] -5.74 0.02 6.79 0.56 0.02 0.73 1 3866 1738
x[19] -6.01 -0.03 6.04 0.71 0.02 0.48 1 3802 2497
x[20] -5.56 0.01 6.22 0.54 0.03 0.74 1 3875 2085
x[21] -6.21 -0.01 6.22 0.69 0.02 0.75 1 3907 2228
x[22] -6.23 0.00 6.02 0.82 0.02 0.53 1 3701 1981
x[23] -6.17 -0.02 6.09 0.61 0.02 0.52 1 3893 1972
x[24] -6.25 -0.02 5.56 0.77 0.03 0.49 1 3133 2388
x[25] -6.72 0.02 5.47 0.45 0.02 0.38 1 3898 2286
x[26] -5.61 0.06 5.81 0.30 0.03 0.33 1 3876 2391
x[27] -7.91 -0.03 6.89 0.86 0.03 0.62 1 3694 2015
x[28] -6.29 0.01 6.80 0.40 0.03 0.71 1 3827 2058
x[29] -6.57 0.02 6.30 0.62 0.03 0.68 1 3796 2019
x[30] -5.92 0.03 6.31 0.51 0.03 0.63 1 3879 2023
x[31] -5.80 0.00 6.28 0.51 0.02 0.52 1 3728 2406
x[32] -6.10 -0.02 6.39 0.52 0.02 0.77 1 4410 1851
x[33] -6.00 -0.03 5.39 0.63 0.02 0.36 1 3641 1928
x[34] -5.87 0.04 6.56 0.48 0.03 0.37 1 3854 2455
x[35] -6.81 0.00 6.06 0.56 0.03 0.46 1 3462 2091
x[36] -6.12 0.01 6.09 0.52 0.02 0.51 1 3740 2146
x[37] -5.92 0.03 6.50 0.49 0.02 0.50 1 4235 2157
x[38] -6.46 0.01 6.00 0.86 0.03 0.58 1 3707 2089
x[39] -6.79 -0.01 6.27 0.73 0.02 0.69 1 4049 1971
x[40] -6.97 -0.02 5.62 0.68 0.03 0.52 1 3775 2030
x[41] -6.12 0.02 5.98 0.57 0.02 0.41 1 3947 2179
x[42] -7.48 -0.03 6.42 0.70 0.03 0.68 1 3643 1713
x[43] -5.99 -0.02 6.57 0.58 0.02 0.48 1 3563 2322
x[44] -6.09 0.03 5.93 0.58 0.03 0.47 1 3772 2198
x[45] -6.25 0.04 6.65 0.45 0.03 0.44 1 3756 2397
x[46] -5.88 -0.02 7.59 0.39 0.02 0.81 1 3575 2043
x[47] -6.18 -0.03 5.82 0.59 0.03 0.48 1 3777 2170
x[48] -6.61 0.01 6.03 0.64 0.02 0.47 1 3840 2008
x[49] -5.88 0.02 6.52 0.40 0.03 0.75 1 3834 1875
x[50] -7.08 0.00 6.32 0.63 0.02 0.45 1 3522 2128
I 0.00 0.00 1.00 0.00 0.00 NA 1 2926 2926
lp__ -95.29 -81.24 -69.52 0.33 0.22 0.33 1 1517 2931
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
which_min_ess <- which.min(mon[101:150, 'Bulk_ESS'])
All Rhat < 1.01 and ESS > 400 indicate the sampling worked much better with the alternative parameterization. Appendix E has more results using other alternative parameterizations.
We can further analyze potential problems using local efficiency estimates and rank plots. We take a detailed look at x[24], which has the smallest bulk-ESS of 3485.
We examine the sampling efficiency in different parts of the posterior by computing the efficiency estimates for small interval probability estimates.
plot_local_ess(fit = fit_alt1, par = which_min_ess + 100, nalpha = 20)
The efficiency estimate is good in all parts of the posterior. Further, we examine the sampling efficiency of different quantile estimates.
plot_quantile_ess(fit = fit_alt1, par = which_min_ess + 100, nalpha = 40)
Rank plots also look rather similar across chains.
samp <- as.array(fit_alt1)
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
In summary, the alternative parameterization produces results that look much better than for the nominal parameterization. There are still some differences in the tails, which we also identified via the tail-ESS.
Half-Cauchy priors are common and, for example, in Stan usually set using the nominal parameterization. However, when the constraint <lower=0> is used, Stan does the sampling automatically in the unconstrained log(x) space, which changes the geometry crucially.
writeLines(readLines("half_cauchy_nom.stan"))
parameters {
vector<lower=0>[50] x;
}
model {
x ~ cauchy(0, 1);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the half-Cauchy with nominal parameterization (and positive constraint):
fit_half_nom <- stan(file = 'half_cauchy_nom.stan', seed = 7878, refresh = 0)
There are no warnings, and the sampling is much faster than for the Cauchy nominal model.
mon <- monitor(fit_half_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
Q5 Q50 Q95 SE_Q5 SE_Q50 SE_Q95 Rhat Bulk_ESS Tail_ESS
x[1] 0.08 1.01 14.11 0.01 0.02 1.35 1.00 6705 1696
x[2] 0.08 0.99 13.81 0.01 0.03 1.65 1.00 6019 1390
x[3] 0.07 0.97 13.53 0.01 0.02 0.73 1.00 7470 1829
x[4] 0.09 0.99 11.20 0.01 0.02 0.94 1.00 8348 1729
x[5] 0.07 1.01 14.51 0.01 0.02 1.10 1.00 7567 1622
x[6] 0.08 1.02 13.91 0.01 0.02 1.28 1.00 7493 1693
x[7] 0.08 0.99 13.46 0.01 0.02 1.02 1.00 7712 1731
x[8] 0.09 1.00 11.29 0.01 0.02 1.08 1.00 9379 1611
x[9] 0.07 0.99 13.40 0.01 0.03 1.40 1.00 7531 1510
x[10] 0.08 1.02 12.53 0.01 0.02 1.35 1.00 7457 1456
x[11] 0.09 1.00 10.67 0.01 0.02 0.74 1.01 7357 1738
x[12] 0.08 1.01 12.34 0.01 0.02 1.34 1.00 8039 1581
x[13] 0.08 1.01 11.84 0.01 0.02 0.88 1.00 7175 1751
x[14] 0.08 1.01 11.97 0.01 0.02 0.96 1.00 8532 1505
x[15] 0.08 0.98 13.84 0.01 0.02 0.90 1.00 8956 1355
x[16] 0.08 0.98 11.59 0.01 0.02 1.17 1.00 7819 1698
x[17] 0.08 0.97 11.95 0.00 0.02 1.04 1.00 6986 1899
x[18] 0.08 1.00 10.98 0.01 0.02 0.91 1.00 6999 1537
x[19] 0.08 1.03 12.08 0.01 0.02 1.11 1.00 7613 1759
x[20] 0.07 1.00 14.24 0.01 0.02 1.12 1.00 8368 1674
x[21] 0.08 1.01 11.88 0.01 0.01 0.66 1.00 8162 1724
x[22] 0.09 1.02 11.18 0.01 0.02 1.33 1.00 7670 1780
x[23] 0.07 1.00 14.22 0.01 0.02 2.04 1.00 7419 1677
x[24] 0.07 0.99 12.30 0.01 0.02 1.01 1.00 7346 1797
x[25] 0.09 1.01 11.29 0.01 0.02 1.04 1.00 8103 1720
x[26] 0.07 1.05 13.90 0.01 0.02 1.57 1.00 7079 1776
x[27] 0.07 0.99 14.12 0.01 0.01 1.44 1.00 8381 1580
x[28] 0.07 1.00 15.56 0.01 0.02 1.43 1.00 8576 1744
x[29] 0.08 1.00 13.28 0.01 0.02 1.37 1.00 7709 1903
x[30] 0.06 0.99 14.14 0.01 0.02 1.25 1.00 6518 1578
x[31] 0.08 1.04 16.07 0.01 0.02 1.71 1.00 7130 1677
x[32] 0.09 1.01 13.23 0.01 0.02 1.26 1.00 7794 1824
x[33] 0.08 1.00 12.96 0.01 0.02 0.92 1.00 7319 1691
x[34] 0.09 0.97 11.01 0.01 0.02 0.77 1.00 7419 1738
x[35] 0.07 1.00 13.56 0.01 0.02 1.19 1.00 6842 1618
x[36] 0.07 1.01 13.18 0.01 0.02 1.43 1.00 7526 1513
x[37] 0.09 1.03 11.61 0.01 0.02 0.81 1.00 7229 1729
x[38] 0.08 0.99 15.73 0.01 0.02 1.55 1.00 8059 1421
x[39] 0.07 1.00 13.93 0.01 0.02 1.31 1.00 8216 1684
x[40] 0.07 1.03 12.88 0.01 0.02 1.14 1.00 7306 1455
x[41] 0.07 0.97 13.42 0.01 0.02 1.44 1.00 7207 1644
x[42] 0.09 0.99 11.94 0.01 0.02 1.00 1.00 7528 1635
x[43] 0.09 1.01 12.12 0.01 0.02 0.84 1.00 7381 1931
x[44] 0.08 1.02 12.66 0.01 0.02 1.34 1.00 6487 1704
x[45] 0.07 1.00 13.25 0.01 0.02 1.18 1.00 6961 1809
x[46] 0.07 1.03 14.46 0.01 0.02 1.14 1.00 7899 1491
x[47] 0.08 1.00 13.13 0.01 0.02 1.14 1.00 6569 1958
x[48] 0.07 1.02 12.49 0.01 0.02 0.87 1.00 7776 1715
x[49] 0.08 1.00 12.68 0.01 0.02 1.29 1.00 7916 1651
x[50] 0.10 0.99 11.95 0.01 0.02 1.38 1.00 8105 1615
I 0.00 0.00 1.00 0.00 0.50 NA 1.00 5825 5825
lp__ -80.63 -69.45 -59.66 0.29 0.23 0.25 1.00 1141 2144
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
All Rhat < 1.01 and ESS > 400 indicate good performance of the sampler. We see that the Stan’s automatic (and implicit) transformation of constraint parameters can have a big effect on the sampling performance. More experiments with different parameterizations of the half-Cauchy distribution can be found in Appendix E.
The Eight Schools data is a classic example for hierarchical models (see Section 5.5 in Gelman et al., 2013), which despite the apparent simplicity nicely illustrates the typical problems in inference for hierarchical models. The Stan models below are from Michael Betancourt’s case study on Diagnosing Biased Inference with Divergences. Appendix F contains more detailed analysis of different algorithm variants.
writeLines(readLines("eight_schools_cp.stan"))
data {
int<lower=0> J;
real y[J];
real<lower=0> sigma[J];
}
parameters {
real mu;
real<lower=0> tau;
real theta[J];
}
model {
mu ~ normal(0, 5);
tau ~ cauchy(0, 5);
theta ~ normal(mu, tau);
y ~ normal(theta, sigma);
}
We directly run the centered parameterization model with an increased adapt_delta value to reduce the probability of getting divergent transitions.
eight_schools <- read_rdump("eight_schools.data.R")
fit_cp <- stan(
file = 'eight_schools_cp.stan', data = eight_schools,
iter = 2000, chains = 4, seed = 483892929, refresh = 0,
control = list(adapt_delta = 0.95)
)
Warning: There were 28 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: Examine the pairs() plot to diagnose sampling problems
Despite an increased adapt_delta, we still observe a lot of divergent transitions, which in itself is already sufficient indicator to not trust the results. We can use Rhat and ESS diagnostics to recognize problematic parts of the posterior and they could be used in cases when other MCMC algorithms than HMC is used.
mon <- monitor(fit_cp)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
Q5 Q50 Q95 SE_Q5 SE_Q50 SE_Q95 Rhat Bulk_ESS Tail_ESS
mu -0.93 4.53 9.85 0.17 0.13 0.24 1.01 695 811
tau 0.71 3.02 9.79 0.07 0.17 0.31 1.02 218 1040
theta[1] -1.24 5.88 16.01 0.21 0.20 0.51 1.01 991 823
theta[2] -2.44 4.99 12.84 0.24 0.14 0.24 1.00 1247 1115
theta[3] -4.99 4.32 12.13 0.47 0.21 0.27 1.00 1168 940
theta[4] -2.91 4.88 12.73 0.23 0.21 0.27 1.00 1024 883
theta[5] -3.99 3.95 10.80 0.27 0.16 0.18 1.00 951 1100
theta[6] -3.97 4.38 11.45 0.37 0.16 0.23 1.01 1196 989
theta[7] -0.91 6.04 15.06 0.23 0.21 0.49 1.00 936 825
theta[8] -3.08 4.86 13.78 0.34 0.19 0.38 1.00 1287 897
lp__ -24.50 -15.31 -4.61 0.33 0.46 0.78 1.02 210 450
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
See Appendix F for results of longer chains.
Bulk-ESS for the between school standard deviation tau is 0.05<0.01, indicating we should investigate that parameter more carefully. We thus examine the sampling efficiency in different parts of the posterior by computing the efficiency estimate for small interval estimates for tau. These plots may either show quantiles or parameter values at the vertical axis. Red ticks show divergent transitions.
plot_local_ess(fit = fit_cp, par = "tau", nalpha = 20)
plot_local_ess(fit = fit_cp, par = "tau", nalpha = 20, rank = FALSE)
We see that the sampler has difficulties in exploring small tau values. As the sampling efficiency for estimating small tau values is practically zero, we may assume that we may miss substantial amount of posterior mass and get biased estimates. Red ticks, which show iterations with divergences, have concentrated to small tau values, indicate also problems exploring small values which is likely to cause bias.
We examine also the sampling efficiency of different quantile estimates. Again, these plots may either show quantiles or parameter values at the vertical axis.
plot_quantile_ess(fit = fit_cp, par = "tau", nalpha = 40)
plot_quantile_ess(fit = fit_cp, par = "tau", nalpha = 40, rank = FALSE)
Most of the quantile estimates have worryingly low effective sample size.
Let’s see how the estimated effective sample size changes when we use more and more draws. Here we don’t see sudden changes, but bulk-ESS is too low for reliable convergence diagnostic. See Appendix F for results of longer chains.
plot_change_ess(fit = fit_cp, par = "tau")
In lines with these findings, the rank plots of tau clearly show problems in the mixing of the chains.
samp_cp <- as.array(fit_cp)
mcmc_hist_r_scale(samp_cp[, , "tau"])
For hierarchical models, the non-centered parameterization often works better than the centered one:
writeLines(readLines("eight_schools_ncp.stan"))
data {
int<lower=0> J;
real y[J];
real<lower=0> sigma[J];
}
parameters {
real mu;
real<lower=0> tau;
real theta_tilde[J];
}
transformed parameters {
real theta[J];
for (j in 1:J)
theta[j] = mu + tau * theta_tilde[j];
}
model {
mu ~ normal(0, 5);
tau ~ cauchy(0, 5);
theta_tilde ~ normal(0, 1);
y ~ normal(theta, sigma);
}
For reasons of comparability, we also run the non-centered parameterization model with an increased adapt_delta value:
fit_ncp2 <- stan(
file = 'eight_schools_ncp.stan', data = eight_schools,
iter = 2000, chains = 4, seed = 483892929, refresh = 0,
control = list(adapt_delta = 0.95)
)
We get zero divergences and no other warnings which is a first good sign.
mon <- monitor(fit_ncp2)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
Q5 Q50 Q95 SE_Q5 SE_Q50 SE_Q95 Rhat Bulk_ESS Tail_ESS
mu -1.00 4.41 9.80 0.08 0.06 0.11 1 5201 2100
tau 0.24 2.73 9.46 0.02 0.07 0.21 1 2669 3073
theta_tilde[1] -1.35 0.30 1.96 0.04 0.01 0.04 1 5593 1980
theta_tilde[2] -1.45 0.11 1.63 0.04 0.02 0.04 1 5607 2044
theta_tilde[3] -1.70 -0.09 1.60 0.04 0.02 0.04 1 5602 1787
theta_tilde[4] -1.44 0.09 1.63 0.04 0.02 0.03 1 5810 2060
theta_tilde[5] -1.66 -0.17 1.39 0.04 0.02 0.03 1 5022 1958
theta_tilde[6] -1.60 -0.07 1.54 0.04 0.02 0.04 1 5057 1802
theta_tilde[7] -1.31 0.34 1.88 0.05 0.02 0.03 1 4562 2005
theta_tilde[8] -1.50 0.07 1.66 0.03 0.02 0.03 1 5754 1992
theta[1] -1.78 5.66 16.24 0.20 0.08 0.46 1 4396 2697
theta[2] -2.33 4.83 12.35 0.17 0.07 0.19 1 5315 2545
theta[3] -5.07 4.17 11.92 0.26 0.08 0.19 1 4351 2323
theta[4] -2.77 4.76 12.52 0.23 0.07 0.21 1 5822 2492
theta[5] -4.41 3.85 10.59 0.20 0.06 0.11 1 4822 2555
theta[6] -3.82 4.28 11.45 0.23 0.07 0.16 1 4864 2473
theta[7] -1.18 5.86 15.20 0.15 0.06 0.26 1 4804 2839
theta[8] -3.36 4.78 12.93 0.25 0.09 0.39 1 5097 2479
lp__ -11.20 -6.67 -3.76 0.16 0.06 0.06 1 1485 2748
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
All Rhat < 1.01 and ESS > 400 indicate a much better performance of the non-centered parameterization.
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates for tau.
plot_local_ess(fit = fit_ncp2, par = 2, nalpha = 20)
Small tau values are still more difficult to explore, but the relative efficiency is in a good range. We may also check this with a finer resolution:
plot_local_ess(fit = fit_ncp2, par = 2, nalpha = 50)
The sampling efficiency for different quantile estimates looks good as well.
plot_quantile_ess(fit = fit_ncp2, par = 2, nalpha = 40)
In line with these findings, the rank plots of tau show no substantial differences between chains.
samp_ncp2 <- as.array(fit_ncp2)
mcmc_hist_r_scale(samp_ncp2[, , 2])
Betancourt, M. (2017) ‘A conceptual introduction to hamiltonian monte carlo’, arXiv preprint arXiv:1701.02434.
Brooks, S. P. and Gelman, A. (1998) ‘General methods for monitoring convergence of iterative simulations’, Journal of Computational and Graphical Statistics, 7(4), pp. 434–455.
Gelman, A. et al. (2013) Bayesian data analysis, third edition. CRC Press.
Gelman, A. and Rubin, D. B. (1992) ‘Inference from iterative simulation using multiple sequences’, Statistical science, 7(4), pp. 457–472.
Geyer, C. J. (1992) ‘Practical Markov chain Monte Carlo’, Statistical Science, 7, pp. 473–483.
Geyer, C. J. (2011) ‘Introduction to Markov chain Monte Carlo’, in Brooks, S. et al. (eds) Handbook of markov chain monte carlo. CRC Press.
Hoffman, M. D. and Gelman, A. (2014) ‘The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo’, Journal of Machine Learning Research, 15, pp. 1593–1623. Available at: http://jmlr.org/papers/v15/hoffman14a.html.
Stan Development Team (2018a) Bayesian statistics using stan. Stan Development Team. Available at: https://github.com/stan-dev/stan-book.
Stan Development Team (2018b) ‘RStanArm: Bayesian applied regression modeling via Stan. R package version 2.17.4’. Available at: http://mc-stan.org.
Stan Development Team (2018c) ‘RStan: The R interface to Stan. R package version 2.17.3’. Available at: http://mc-stan.org.
Stan Development Team (2018d) ‘Stan modeling language users guide and reference manual. Version 2.18.0’. Available at: http://mc-stan.org.
Stan Development Team (2018e) ‘The Stan core library version 2.18.0’. Available at: http://mc-stan.org.
The following abbreviations are used throughout the appendices:
We will illustrate the rank normalization with a few examples. First, we plot histograms, and empirical cumulative distribution functions (ECDF) with respect to the original parameter values (\(\theta\)), scaled ranks (ranks divided by the maximum rank), and rank normalized values (z). We used scaled ranks to make the plots look similar for different number of draws.
100 draws from Normal(0, 1):
n <- 100
theta <- rnorm(n)
plot_ranknorm(theta, n)
100 draws from Exponential(1):
theta <- rexp(n)
plot_ranknorm(theta, n)
100 draws from Cauchy(0, 1):
theta <- rcauchy(n)
plot_ranknorm(theta, n)
In the above plots, the ECDF with respect to scaled rank and rank normalized \(z\)-values look exactly the same for all distributions. In Split-\(\widehat{R}\) and effective sample size computations, we rank all draws jointly, but then compare ranks and ECDF of individual split chains. To illustrate the variation between chains, we draw 8 batches of 100 draws each from Normal(0, 1):
n <- 100
m <- 8
theta <- rnorm(n * m)
plot_ranknorm(theta, n, m)
The variation in ECDF due to the variation ranks is now visible also in scaled ranks and rank normalized \(z\)-values from different batches (chains).
The benefit of rank normalization is more obvious for non-normal distribution such as Cauchy:
theta <- rcauchy(n * m)
plot_ranknorm(theta, n, m)
Rank normalization makes the subsequent computations well defined and invariant under bijective transformations. This means that we get the same results, for example, if we use unconstrained or constrained parameterisations in a model.
In Section 3, we had defined the empirical CDF (ECDF) for any \(\theta_\alpha\) as \[ p(\theta \leq \theta_\alpha) \approx \bar{I}_\alpha = \frac{1}{S}\sum_{s=1}^S I(\theta^{(s)} \leq\theta_\alpha), \]
For independent draws, \(\bar{I}_\alpha\) has a \({\rm Beta}(\bar{I}_\alpha+1, S - \bar{I}_\alpha + 1)\) distribution. Thus we can easily examine the variation of the ECDF for any \(\theta_\alpha\) value from a single chain. If \(\bar{I}_\alpha\) is not very close to \(1\) or \(S\) and \(S\) is large, we can use the variance of Beta distribution
\[ {\rm Var}[p(\theta \leq \theta_\alpha)] = \frac{(\bar{I}_\alpha+1)*(S-\bar{I}_\alpha+1)}{(S+2)^2(S+3)}. \] We illustrate uncertainty intervals of the Beta distribution and normal approximation of ECDF for 100 draws from Normal(0, 1):
n <- 100
m <- 1
theta <- rnorm(n * m)
plot_ranknorm(theta, n, m, interval = TRUE)
Uncertainty intervals of ECDF for draws from Cauchy(0, 1) illustrate again the improved visual clarity in plotting when using scaled ranks:
n <- 100
m <- 1
theta <- rcauchy(n * m)
plot_ranknorm(theta, n, m, interval = TRUE)
The above plots illustrate that the normal approximation is accurate for practical purposes in MCMC diagnostics.
This part focuses on diagnostics for
To simplify, in this part, independent draws are used as a proxy for very efficient MCMC sampling. First, we sample draws from a standard-normal distribution. We will discuss the behavior for non-normal distributions later. See Appendix A for the abbreviations used.
All chains are from the same Normal(0, 1) distribution plus a linear trend added to all chains:
conds <- expand.grid(
iters = c(250, 1000, 4000),
trend = c(0, 0.25, 0.5, 0.75, 1),
rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
iters <- conds[i, "iters"]
trend <- conds[i, "trend"]
rep <- conds[i, "rep"]
r <- array(rnorm(iters * chains), c(iters, chains))
r <- r + seq(-trend, trend, length.out = iters)
rs <- as.data.frame(monitor_extra(r))
res[[i]] <- cbind(iters, trend, rep, rs)
}
res <- bind_rows(res)
If we don’t split chains, Rhat misses the trends if all chains still have a similar marginal distribution.
ggplot(data = res, aes(y = Rhat, x = trend)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Rhat without splitting chains')
Split-Rhat can detect trends, even if the marginals of the chains are similar.
ggplot(data = res, aes(y = zsRhat, x = trend)) +
geom_point() + geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Split-Rhat')
Result: Split-Rhat is useful for detecting non-stationarity (i.e., trends) in the chains. If we use a threshold of \(1.01\), we can detect trends which account for 2% or more of the total marginal variance. If we use a threshold of \(1.1\), we detect trends which account for 30% or more of the total marginal variance.
The effective sample size is based on split Rhat and within-chain autocorrelation. We plot relative efficiency \(R_{\rm eff}=S_{\rm eff}/S\) for easier comparison between different values of \(S\).
ggplot(data = res, aes(y = zsreff, x = trend)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = c(0,1)) +
geom_hline(yintercept = 0.1, linetype = 'dashed') +
ggtitle('Relative Bulk-ESS (zsreff)') +
scale_y_continuous(breaks = seq(0, 1, by = 0.25))
Result: Split-Rhat is more sensitive to trends for small sample sizes, but effective sample size becomes more sensitive for larger samples sizes (as autocorrelations can be estimated more accurately).
Advice: If in doubt, run longer chains for more accurate convergence diagnostics.
Next we investigate the sensitivity to detect if one of the chains has not converged to the same distribution as the others, but has a different mean.
conds <- expand.grid(
iters = c(250, 1000, 4000),
shift = c(0, 0.25, 0.5, 0.75, 1),
rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
iters <- conds[i, "iters"]
shift <- conds[i, "shift"]
rep <- conds[i, "rep"]
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] + shift
rs <- as.data.frame(monitor_extra(r))
res[[i]] <- cbind(iters, shift, rep, rs)
}
res <- bind_rows(res)
ggplot(data = res, aes(y = zsRhat, x = shift)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Split-Rhat')
Result: If we use a threshold of \(1.01\), we can detect shifts with a magnitude of one third or more of the marginal standard deviation. If we use a threshold of \(1.1\), we detect shifts with a magnitude equal to or larger than the marginal standard deviation.
ggplot(data = res, aes(y = zsreff, x = shift)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = c(0,1)) +
geom_hline(yintercept = 0.1, linetype = 'dashed') +
ggtitle('Relative Bulk-ESS (zsreff)') +
scale_y_continuous(breaks = seq(0, 1, by = 0.25))
Result: The effective sample size is not as sensitive, but a shift with a magnitude of half the marginal standard deviation or more will lead to very low relative efficiency when the total number of draws increases.
Rank plots can be used to visualize differences between chains. Here, we show rank plots for the case of 4 chains, 250 draws per chain, and a shift of 0.5.
iters = 250
chains = 4
shift = 0.5
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] + shift
colnames(r) <- 1:4
mcmc_hist_r_scale(r)
Although, Rhat was less than \(1.05\) for this situation, the rank plots clearly show that the first chains behaves differently.
Next, we investigate the sensitivity to detect if one of the chains has not converged to the same distribution as the others, but has lower marginal variance.
conds <- expand.grid(
iters = c(250, 1000, 4000),
scale = c(0, 0.25, 0.5, 0.75, 1),
rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
iters <- conds[i, "iters"]
scale <- conds[i, "scale"]
rep <- conds[i, "rep"]
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] * scale
rs <- as.data.frame(monitor_extra(r))
res[[i]] <- cbind(iters, scale, rep, rs)
}
res <- bind_rows(res)
We first look at the Rhat estimates:
ggplot(data = res, aes(y = zsRhat, x = scale)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Split-Rhat')
Result: Split-Rhat is not able to detect scale differences between chains.
ggplot(data = res, aes(y = zfsRhat, x = scale)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Folded-split-Rhat')
Result: Folded-Split-Rhat focuses on scales and detects scale differences.
Result: If we use a threshold of \(1.01\), we can detect a chain with scale less than \(3/4\) of the standard deviation of the others. If we use threshold of \(1.1\), we detect a chain with standard deviation less than \(1/4\) of the standard deviation of the others.
Next, we look at the effective sample size estimates:
ggplot(data = res, aes(y = zsreff, x = scale)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = c(0,1)) +
geom_hline(yintercept = 0.1, linetype = 'dashed') +
ggtitle('Relative Bulk-ESS (zsreff)') +
scale_y_continuous(breaks = seq(0, 1, by = 0.25))
Result: The bulk effective sample size of the mean does not see a problem as it focuses on location differences between chains.
ggplot(data = res, aes(y = zfsreff, x = scale)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 0) +
geom_hline(yintercept = 0.1, linetype = 'dashed') +
ggtitle('Relative tail-ESS (zfsreff)') +
scale_y_continuous(breaks = seq(0, 1, by = 0.25))
Result: The tail effective sample size detects the difference in scales as it focuses on the scale of the chains.
Rank plots can be used to visualize differences between chains. Here, we show rank plots for the case of 4 chains, 250 draws per chain, and with one chain having a standard deviation of 0.75 as opposed to a standard deviation of 1 for the other chains.
iters = 250
chains = 4
scale = 0.75
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] * scale
colnames(r) <- 1:4
mcmc_hist_r_scale(r)
Although folded Rhat is \(1.06\), the rank plots clearly show that the first chains behaves differently.
The classic split-Rhat is based on calculating within and between chain variances. If the marginal distribution of a chain is such that the variance is not defined (i.e. infinite), the classic split-Rhat is not well justified. In this section, we will use the Cauchy distribution as an example of such distribution. Also in cases where mean and variance are finite, the distribution can be far from Gaussian. Especially distributions with very long tails cause instability for variance and autocorrelation estimates. To alleviate these problems we will use Split-Rhat for rank-normalized draws.
The following Cauchy models are from Michael Betancourt’s case study Fitting The Cauchy Distribution
We already looked at the nominal Cauchy model with direct parameterization in the main text, but for completeness, we take a closer look using different variants of the diagnostics.
writeLines(readLines("cauchy_nom.stan"))
parameters {
vector[50] x;
}
model {
x ~ cauchy(0, 1);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the nominal model:
fit_nom <- stan(file = 'cauchy_nom.stan', seed = 7878, refresh = 0)
Warning: There were 1421 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems
Treedepth exceedence and Bayesian Fraction of Missing Information are dynamic HMC specific diagnostics (Betancourt, 2017). We get warnings about very large number of transitions after warmup that exceeded the maximum treedepth, which is likely due to very long tails of the Cauchy distribution. All chains have low estimated Bayesian fraction of missing information also indicating slow mixing.
Trace plots for the first parameter look wild with occasional large values:
samp <- as.array(fit_nom)
mcmc_trace(samp[, , 1])
Let’s check Rhat and ESS diagnostics.
res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$zsseff)
plot_rhat(res)
For one parameter, Rhats exceed the classic threshold of 1.1. Depending on the Rhat estimate, a few others also exceed the threshold of 1.01. The rank normalized split-Rhat has several values over 1.01. Please note that the classic split-Rhat is not well defined in this example, because mean and variance of the Cauchy distribution are not finite.
plot_ess(res)
Both classic and new effective sample size estimates have several very small values, and so the overall sample shouldn’t be trusted.
Result: Effective sample size is more sensitive than (rank-normalized) split-Rhat to detect problems of slow mixing.
We also check the log posterior value lp__ and find out that the effective sample size is worryingly low.
res <- monitor_extra(samp[, , 51:52])
cat('lp__: Bulk-ESS = ', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 249
cat('lp__: Tail-ESS = ', round(res['lp__', 'zfsseff'], 2), '\n')
lp__: Tail-ESS = 655
We can further analyze potential problems using local effective sample size and rank plots. We examine x[5], which has the smallest bulk-ESS 249.
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates (see Section Efficiency for small interval probability estimates). Each interval contains \(1/k\) of the draws (e.g., with \(k=20\)). The small interval efficiency measures mixing of an indicator function which indicates when the values are inside the specific small interval. This gives us a local efficiency measure which does not depend on the shape of the distribution.
plot_local_ess(fit = fit_nom, par = which_min_ess, nalpha = 20)
We see that the efficiency is worryingly low in the tails (which is caused by slow mixing in long tails of Cauchy). Orange ticks show draws that exceeded the maximum treedepth.
An alternative way to examine the effective sample size in different parts of the posterior is to compute effective sample size for quantiles (see Section Efficiency for quantiles). Each interval has a specified proportion of draws, and the efficiency measures mixing of an indicator function’s results which indicate when the values are inside the specific interval.
plot_quantile_ess(fit = fit_nom, par = which_min_ess, nalpha = 40)
We see that the efficiency is worryingly low in the tails (which is caused by slow mixing in long tails of Cauchy). Orange ticks show draws that exceeded the maximum treedepth.
We can further analyze potential problems using rank plots, from which we clearly see differences between chains.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
We can try to improve the performance by increasing max_treedepth to \(20\):
fit_nom_td20 <- stan(
file = 'cauchy_nom.stan', seed = 7878,
refresh = 0, control = list(max_treedepth = 20)
)
Warning: There were 1 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 20. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems
Trace plots for the first parameter still look wild with occasional large values.
samp <- as.array(fit_nom_td20)
mcmc_trace(samp[, , 1])
res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$zsseff)
We check the diagnostics for all \(x\).
plot_rhat(res)
All Rhats are below \(1.1\), but many are over \(1.01\). Classic split-Rhat has more variation than the rank normalized Rhat (note that the former is not well defined). The folded rank normalized Rhat shows that there is still more variation in the scale than in the location between different chains.
plot_ess(res)
Some classic effective sample sizes are very small. If we wouldn’t realize that the variance is infinite, we might try to run longer chains, but in case of an infinite variance, zero relative efficiency (ESS/S) is the truth and longer chains won’t help with that. However other quantities can be well defined, and that’s why it is useful to also look at the rank normalized version as a generic transformation to achieve finite mean and variance. The smallest bulk-ESS is less than 1000, which is not that bad. The smallest median-ESS is larger than 2500, that is we are able to estimate the median efficiently. However, many tail-ESS’s are less than 400 indicating problems for estimating the scale of the posterior.
Result: The rank normalized effective sample size is more stable than classic effective sample size, which is not well defined for the Cauchy distribution.
Result: It is useful to look at both bulk- and tail-ESS.
We check also lp__. Although increasing max_treedepth improved bulk-ESS of x, the efficiency for lp__ didn’t change.
res <- monitor_extra(samp[, , 51:52])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 155
cat('lp__: Tail-ESS =', round(res['lp__', 'zfsseff'], 2), '\n')
lp__: Tail-ESS = 558
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.
plot_local_ess(fit = fit_nom_td20, par = which_min_ess, nalpha = 20)
It seems that increasing max_treedepth has not much improved the efficiency in the tails. We also examine the effective sample size of different quantile estimates.
plot_quantile_ess(fit = fit_nom_td20, par = which_min_ess, nalpha = 40)
The rank plot visualisation of x[50], which has the smallest effective sample size of 155 among the \(x\), indicates clear convergence problems.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
The rank plot visualisation of lp__, which has an effective sample size 155, doesn’t look so good either.
mcmc_hist_r_scale(samp[, , "lp__"])
Let’s try running 8 times longer chains.
fit_nom_td20l <- stan(
file = 'cauchy_nom.stan', seed = 7878,
refresh = 0, control = list(max_treedepth = 20),
warmup = 1000, iter = 9000
)
Warning: There were 2 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 20. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems
Trace plots for the first parameter still look wild with occasional large values.
samp <- as.array(fit_nom_td20l)
mcmc_trace(samp[, , 1])
res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$zsseff)
Let’s check the diagnostics for all \(x\).
plot_rhat(res)
All Rhats are below \(1.01\). The classic split-Rhat has more variation than the rank normalized Rhat (note that the former is not well defined in this case).
plot_ess(res)
Most classic ESS’s are close to zero. Running longer chains just made most classic ESS’s even smaller.
The smallest bulk-ESS are around 5000, which is not that bad. The smallest median-ESS’s are larger than 25000, that is we are able to estimate the median efficiently. However, the smallest tail-ESS is 3165 indicating problems for estimating the scale of the posterior.
Result: The rank normalized effective sample size is more stable than classic effective sample size even for very long chains.
Result: It is useful to look at both bulk- and tail-ESS.
We also check lp__. Although increasing the number of iterations improved bulk-ESS of the \(x\), the relative efficiency for lp__ didn’t change.
res <- monitor_extra(samp[, , 51:52])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 1285
cat('lp__: Tail-ESS =', round(res['lp__', 'zfsseff'], 2), '\n')
lp__: Tail-ESS = 4037
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.
plot_local_ess(fit = fit_nom_td20l, par = which_min_ess, nalpha = 20)
Increasing the chain length did not seem to change the relative efficiency. With more draws from the longer chains we can use a finer resolution for the local efficiency estimates.
plot_local_ess(fit = fit_nom_td20l, par = which_min_ess, nalpha = 100)
The sampling efficiency far in the tails is worryingly low. This was more difficult to see previously with less draws from the tails. We see similar problems in the plot of effective sample size for quantiles.
plot_quantile_ess(fit = fit_nom_td20l, par = which_min_ess, nalpha = 100)
Let’s look at the rank plot visualisation of x[48], which has the smallest effective sample size 1285 among the \(x\).
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
Increasing the number of iterations couldn’t remove the mixing problems at the tails. The mixing problem is inherent to the nominal parameterization of Cauchy distribution.
Next, we examine an alternative parameterization and consider the Cauchy distribution as a scale mixture of Gaussian distributions. The model has two parameters and the Cauchy distributed \(x\) can be computed from those. In addition to improved sampling performance, the example illustrates that focusing on diagnostics matters.
writeLines(readLines("cauchy_alt_1.stan"))
parameters {
vector[50] x_a;
vector<lower=0>[50] x_b;
}
transformed parameters {
vector[50] x = x_a ./ sqrt(x_b);
}
model {
x_a ~ normal(0, 1);
x_b ~ gamma(0.5, 0.5);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
We run the alternative model:
fit_alt1 <- stan(file='cauchy_alt_1.stan', seed=7878, refresh = 0)
There are no warnings and the sampling is much faster.
samp <- as.array(fit_alt1)
res <- monitor_extra(samp[, , 101:150])
which_min_ess <- which(res$zsseff == min(res$zsseff))
plot_rhat(res)
All Rhats are below \(1.01\). Classic split-Rhats also look good even though they are not well defined for the Cauchy distribution.
plot_ess(res)
Result: Rank normalized ESS’s have less variation than classic one which is not well defined for Cauchy.
We check lp__:
res <- monitor_extra(samp[, , 151:152])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 1517
cat('lp__: Tail-ESS =', round(res['lp__', 'zfsseff'], 2), '\n')
lp__: Tail-ESS = 2931
The relative efficiencies for lp__ are also much better than with the nominal parameterization.
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.
plot_local_ess(fit = fit_alt1, par = 100 + which_min_ess, nalpha = 20)
The effective sample size is good in all parts of the posterior. We also examine the effective sample size of different quantile estimates.
plot_quantile_ess(fit = fit_alt1, par = 100 + which_min_ess, nalpha = 40)
We compare the mean relative efficiencies of the underlying parameters in the new parameterization and the actual \(x\) we are interested in.
res <- monitor_extra(samp[, , 101:150])
res1 <- monitor_extra(samp[, , 1:50])
res2 <- monitor_extra(samp[, , 51:100])
cat('Mean Bulk-ESS for x =' , round(mean(res[, 'zsseff']), 2), '\n')
Mean Bulk-ESS for x = 3782.24
cat('Mean Tail-ESS for x =' , round(mean(res[, 'zfsseff']), 2), '\n')
Mean Tail-ESS for x = 2119.68
cat('Mean Bulk-ESS for x_a =' , round(mean(res1[, 'zsseff']), 2), '\n')
Mean Bulk-ESS for x_a = 4043.48
cat('Mean Bulk-ESS for x_b =' , round(mean(res2[, 'zsseff']), 2), '\n')
Mean Bulk-ESS for x_b = 2638.64
Result: We see that the effective sample size of the interesting \(x\) can be different from the effective sample size of the parameters \(x_a\) and \(x_b\) that we used to compute it.
The rank plot visualisation of x[24], which has the smallest effective sample size of 3133 among the \(x\) looks better than for the nominal parameterization.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
Similarly, the rank plot visualisation of lp__, which has a relative efficiency of -81.62, 0.2, 7.89, -95.29, -81.24, -69.52, 1496, 0.37, 1507, 1506, 1517, 0.38, 1, 1, 1, 1, 1, 2931, 0.73, 1945, 0.49, 2973, 0.74 looks better than for the nominal parameterization.
mcmc_hist_r_scale(samp[, , "lp__"])
Another alternative parameterization is obtained by a univariate transformation as shown in the following code (see also the 3rd alternative in Michael Betancourt’s case study).
writeLines(readLines("cauchy_alt_3.stan"))
parameters {
vector<lower=0, upper=1>[50] x_tilde;
}
transformed parameters {
vector[50] x = tan(pi() * (x_tilde - 0.5));
}
model {
// Implicit uniform prior on x_tilde
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
We run the alternative model:
fit_alt3 <- stan(file='cauchy_alt_3.stan', seed=7878, refresh = 0)
There are no warnings, and the sampling is much faster than for the nominal model.
samp <- as.array(fit_alt3)
res <- monitor_extra(samp[, , 51:100])
which_min_ess <- which(res$zsseff == min(res$zsseff))
plot_rhat(res)
All Rhats except some folded Rhats are below 1.01. Classic split-Rhat’s look also good even though it is not well defined for the Cauchy distribution.
plot_ess(res)
Result: Rank normalized relative efficiencies have less variation than classic ones. Bulk-ESS and median-ESS are slightly larger than 1, which is possible for antithetic Markov chains which have negative correlation for odd lags.
We also take a closer look at the lp__ value:
res <- monitor_extra(samp[, , 101:102])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 1334
cat('lp__: Tail-ESS =', round(res['lp__', 'zfsseff'], 2), '\n')
lp__: Tail-ESS = 2269
The effective sample size for these are also much better than with the nominal parameterization.
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.
plot_local_ess(fit = fit_alt3, par = 50 + which_min_ess, nalpha = 20)
We examine also the sampling efficiency of different quantile estimates.
plot_quantile_ess(fit = fit_alt3, par = 50 + which_min_ess, nalpha = 40)
The effective sample size in tails is worse than for the first alternative parameterization, although it’s still better than for the nominal parameterization.
We compare the mean effective sample size of the underlying parameter in the new parameterization and the actually Cauchy distributed \(x\) we are interested in.
res <- monitor_extra(samp[, , 51:100])
res1 <- monitor_extra(samp[, , 1:50])
cat('Mean bulk-seff for x =' , round(mean(res[, 'zsseff']), 2), '\n')
Mean bulk-seff for x = 4888.2
cat('Mean tail-seff for x =' , round(mean(res[, 'zfsseff']), 2), '\n')
Mean tail-seff for x = 1602.1
cat('Mean bulk-seff for x_tilde =' , round(mean(res1[, 'zsseff']), 2), '\n')
Mean bulk-seff for x_tilde = 4888.2
cat('Mean tail-seff for x_tilde =' , round(mean(res1[, 'zfsseff']), 2), '\n')
Mean tail-seff for x_tilde = 1609.82
The Rank plot visualisation of x[18], which has the smallest effective sample size of 4225 among the \(x\) reveals shows good efficiency, similar to the results for lp__.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
mcmc_hist_r_scale(samp[, , "lp__"])
Half-Cauchy priors are common and, for example, in Stan usually set using the nominal parameterization. However, when the constraint <lower=0> is used, Stan does the sampling automatically in the unconstrained log(x) space, which changes the geometry crucially.
writeLines(readLines("half_cauchy_nom.stan"))
parameters {
vector<lower=0>[50] x;
}
model {
x ~ cauchy(0, 1);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
We run the half-Cauchy model with nominal parameterization (and positive constraint).
fit_half_nom <- stan(file = 'half_cauchy_nom.stan', seed = 7878, refresh = 0)
There are no warnings and the sampling is much faster than for the full Cauchy distribution with nominal parameterization.
samp <- as.array(fit_half_nom)
res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$zsseff)
plot_rhat(res)
All Rhats are below \(1.01\). Classic split-Rhats also look good even though they are not well defined for the half-Cauchy distribution.
plot_ess(res)
Result: Rank normalized effective sample size have less variation than classic ones. Some Bulk-ESS and median-ESS are larger than 1, which is possible for antithetic Markov chains which have negative correlation for odd lags.
Due to the <lower=0> constraint, the sampling was made in the log(x) space, and we can also check the performance in that space.
res <- monitor_extra(log(samp[, , 1:50]))
plot_ess(res)
\(\log(x)\) is quite close to Gaussian, and thus classic effective sample size is also close to rank normalized ESS which is exactly the same as for the original \(x\) as rank normalization is invariant to bijective transformations.
Result: The rank normalized effective sample size is close to the classic effective sample size for transformations which make the distribution close to Gaussian.
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.
plot_local_ess(fit = fit_half_nom, par = which_min_ess, nalpha = 20)
The effective sample size is good overall, with only a small dip in tails. We can also examine the effective sample size of different quantile estimates.
plot_quantile_ess(fit = fit_half_nom, par = which_min_ess, nalpha = 40)
The rank plot visualisation of x[2], which has the smallest effective sample size of 6019 among \(x\), looks good.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
The rank plot visualisation of lp__ reveals some small differences in the scales, but it’s difficult to know whether this small variation from uniform is relevant.
mcmc_hist_r_scale(samp[, , "lp__"])
writeLines(readLines("half_cauchy_alt.stan"))
parameters {
vector<lower=0>[50] x_a;
vector<lower=0>[50] x_b;
}
transformed parameters {
vector[50] x = x_a .* sqrt(x_b);
}
model {
x_a ~ normal(0, 1);
x_b ~ inv_gamma(0.5, 0.5);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run half-Cauchy with alternative parameterization
fit_half_reparam <- stan(
file = 'half_cauchy_alt.stan', seed = 7878, refresh = 0
)
There are no warnings and the sampling is as fast for the half-Cauchy nominal model.
samp <- as.array(fit_half_reparam)
res <- monitor_extra(samp[, , 101:150])
which_min_ess <- which.min(res$zsseff)
plot_rhat(res)
plot_ess(res)
Result: The Rank normalized relative efficiencies have less variation than classic ones which is not well defined for the Cauchy distribution. Based on bulk-ESS and median-ESS, the efficiency for central quantities is much lower, but based on tail-ESS and MAD-ESS, the efficiency in the tails is slightly better than for the half-Cauchy distribution with nominal parameterization. We also see that a parameterization which is good for the full Cauchy distribution is not necessarily good for the half-Cauchy distribution as the <lower=0> constraint additionally changes the parameterization.
We also check the lp__ values:
res <- monitor_extra(samp[, , 151:152])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 734
cat('lp__: Tail-ESS =', round(res['lp__', 'zfsseff'], 2), '\n')
lp__: Tail-ESS = 1992
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.
plot_local_ess(fit_half_reparam, par = 100 + which_min_ess, nalpha = 20)
We also examine the effective sample size for different quantile estimates.
plot_quantile_ess(fit_half_reparam, par = 100 + which_min_ess, nalpha = 40)
The effective sample size near zero is much worse than for the half-Cauchy distribution with nominal parameterization.
The Rank plot visualisation of x[17], which has the smallest relative efficiency of 734 among the \(x\), reveals deviations from uniformity, which is expected with lower effective sample size.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
A similar result is obtained when looking at the rank plots of lp__.
mcmc_hist_r_scale(samp[, , "lp__"])
We continue with our discussion about hierarchical models on the Eight Schools data, which we started in Section Eight Schools. We also analyse the performance of different variants of the diagnostics.
writeLines(readLines("eight_schools_cp.stan"))
data {
int<lower=0> J;
real y[J];
real<lower=0> sigma[J];
}
parameters {
real mu;
real<lower=0> tau;
real theta[J];
}
model {
mu ~ normal(0, 5);
tau ~ cauchy(0, 5);
theta ~ normal(mu, tau);
y ~ normal(theta, sigma);
}
In the main text, we observed that the centered parameterization of this hierarchical model did not work well with the default MCMC options of Stan plus increased adapt_delta, and so we directly try to fit the model with longer chains.
Low efficiency can be sometimes compensated with longer chains. Let’s check 10 times longer chain.
fit_cp2 <- stan(
file = 'eight_schools_cp.stan', data = eight_schools,
iter = 20000, chains = 4, seed = 483892929, refresh = 0,
control = list(adapt_delta = 0.95)
)
Warning: There were 736 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
res <- monitor_extra(fit_cp2)
print(res)
Inference for the input samples (4 chains: each with iter = 20000; warmup = 10000):
mean se_mean sd Q5 Q50 Q95 seff reff sseff zseff zsseff zsreff Rhat sRhat
mu 4.39 0.05 3.29 -0.99 4.43 9.73 4752 0.12 4798 4747 4792 0.12 1 1.00
tau 3.81 0.07 3.22 0.49 2.95 10.04 2034 0.05 2044 809 826 0.02 1 1.00
theta[1] 6.31 0.07 5.68 -1.43 5.69 16.40 7272 0.18 7281 6701 6726 0.17 1 1.00
theta[2] 4.93 0.05 4.72 -2.43 4.85 12.71 9695 0.24 9780 8530 8847 0.22 1 1.00
theta[3] 3.90 0.05 5.37 -5.02 4.15 11.97 9998 0.25 10062 8103 8240 0.21 1 1.00
theta[4] 4.79 0.05 4.82 -2.72 4.75 12.65 9066 0.23 9183 7983 7954 0.20 1 1.00
theta[5] 3.56 0.05 4.68 -4.45 3.83 10.65 8240 0.21 8209 7111 7130 0.18 1 1.00
theta[6] 4.04 0.05 4.92 -4.17 4.20 11.73 9595 0.24 9566 8410 8505 0.21 1 1.00
theta[7] 6.44 0.07 5.15 -0.81 5.92 15.58 6260 0.16 6337 5796 5886 0.15 1 1.00
theta[8] 4.86 0.05 5.43 -3.40 4.80 13.57 10213 0.26 10300 8295 8342 0.21 1 1.00
lp__ -14.64 0.25 6.82 -25.06 -15.17 -2.25 764 0.02 781 823 844 0.02 1 1.01
zRhat zsRhat zfsRhat zfsseff zfsreff medsseff medsreff madsseff madsreff
mu 1 1.00 1 8921 0.22 4529 0.11 7922 0.20
tau 1 1.01 1 7088 0.18 2254 0.06 6200 0.16
theta[1] 1 1.00 1 4139 0.10 5826 0.15 7803 0.20
theta[2] 1 1.00 1 7238 0.18 5989 0.15 7661 0.19
theta[3] 1 1.00 1 8034 0.20 4821 0.12 7516 0.19
theta[4] 1 1.00 1 7705 0.19 4882 0.12 7965 0.20
theta[5] 1 1.00 1 9193 0.23 4277 0.11 8038 0.20
theta[6] 1 1.00 1 9230 0.23 5113 0.13 7868 0.20
theta[7] 1 1.00 1 6516 0.16 5453 0.14 7843 0.20
theta[8] 1 1.00 1 6585 0.16 5662 0.14 7462 0.19
lp__ 1 1.01 1 1477 0.04 2082 0.05 4917 0.12
We still get a whole bunch of divergent transitions so it’s clear that the results can’t be trusted even if all other diagnostics were good. Still, it may be worth looking at additional diagnostics to better understand what’s happening.
Some rank-normalized split-Rhats are still larger than \(1.01\). Bulk-ESS for tau and lp__ are around 800 which corresponds to low relative efficiency of \(1\%\), but is above our recommendation of ESS>400. In this kind of cases, it is useful to look at the local efficiency estimates, too (and the larger number of divergences is clear indication of problems, too).
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small intervals for tau.
plot_local_ess(fit = fit_cp2, par = "tau", nalpha = 50)
We see that the sampling has difficulties in exploring small tau values. As ESS<400 for small probability intervals in case of small tau values, we may suspect that we may miss substantial amount of posterior mass and get biased estimates.
We also examine the effective sample size of different quantile estimates.
plot_quantile_ess(fit = fit_cp2, par = "tau", nalpha = 100)
Several quantile estimates have ESS<400, which raises a doubt that there are convergence problems and we may have significant bias.
Let’s see how the Bulk-ESS and Tail-ESS changes when we use more and more draws.
plot_change_ess(fit = fit_cp2, par = "tau")
We see that given recommendation that Bulk-ESS>400 and Tail-ESS>400, they are not sufficient to detect convergence problems in this case, even the tail quantile estimates are able to detect these problems.
The rank plot visualisation of tau also shows clear sticking and mixing problems.
samp_cp2 <- as.array(fit_cp2)
mcmc_hist_r_scale(samp_cp2[, , "tau"])
Similar results are obtained for lp__, which is closely connected to tau for this model.
mcmc_hist_r_scale(samp_cp2[, , "lp__"])
We may also examine small interval efficiencies for mu.
plot_local_ess(fit = fit_cp2, par = "mu", nalpha = 50)
There are gaps of poor efficiency which again indicates problems in the mixing of the chains. However, these problems do not occur for any specific range of values of mu as was the case for tau. This tells us that it’s probably not mu with which the sampler has problems, but more likely tau or a related quantity.
As we observed divergences, we shouldn’t trust any Monte Carlo standard error (MCSE) estimates as they are likely biased, as well. However, for illustration purposes, we compute the MCSE, tail quantiles and corresponding effective sample sizes for the median of mu and tau. Comparing to the shorter MCMC run, using 10 times more draws has not reduced the MCSE to one third as would be expected without problems in the mixing of the chains.
round(quantile_mcse(samp_cp2[ , , "mu"], prob = 0.5), 2)
mcse Q05 Q95 Seff
1 0.06 4.32 4.53 4528.59
round(quantile_mcse(samp_cp2[ , , "tau"], prob = 0.5), 2)
mcse Q05 Q95 Seff
1 0.07 2.83 3.07 2254.07
For further evidence, let’s check 100 times longer chains than the default. This is not something we would recommend doing in practice, as it is not able to solve any problems with divergences as illustrated below.
fit_cp3 <- stan(
file = 'eight_schools_cp.stan', data = eight_schools,
iter = 200000, chains = 4, seed = 483892929, refresh = 0,
control = list(adapt_delta = 0.95)
)
Warning: There were 9955 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
res <- monitor_extra(fit_cp3)
print(res)
Inference for the input samples (4 chains: each with iter = 2e+05; warmup = 1e+05):
mean se_mean sd Q5 Q50 Q95 seff reff sseff zseff zsseff zsreff Rhat sRhat
mu 4.45 0.06 3.44 -1.11 4.41 10.11 3376 0.01 3341 3934 3893 0.01 1 1
tau 3.77 0.03 3.21 0.44 2.91 10.03 11675 0.03 11652 1934 1863 0.00 1 1
theta[1] 6.36 0.05 5.70 -1.57 5.78 16.33 13093 0.03 12903 10874 10743 0.03 1 1
theta[2] 5.02 0.05 4.81 -2.54 4.90 13.21 8225 0.02 8117 8347 8233 0.02 1 1
theta[3] 3.97 0.06 5.42 -5.00 4.16 12.35 8445 0.02 8318 7627 7541 0.02 1 1
theta[4] 4.81 0.05 4.92 -2.99 4.73 13.08 9207 0.02 9141 9000 8938 0.02 1 1
theta[5] 3.63 0.06 4.81 -4.55 3.83 11.14 6505 0.02 6443 6086 6024 0.02 1 1
theta[6] 4.07 0.06 4.98 -4.19 4.19 12.00 7329 0.02 7238 7018 6927 0.02 1 1
theta[7] 6.45 0.05 5.20 -1.04 5.96 15.62 9861 0.02 9769 9479 9393 0.02 1 1
theta[8] 4.92 0.05 5.44 -3.47 4.80 13.58 12287 0.03 12205 10838 10756 0.03 1 1
lp__ -14.54 0.10 6.82 -24.94 -15.07 -2.28 4936 0.01 4929 5153 5146 0.01 1 1
zRhat zsRhat zfsRhat zfsseff zfsreff medsseff medsreff madsseff madsreff
mu 1 1 1 3154 0.01 18360 0.05 14584 0.04
tau 1 1 1 36641 0.09 14003 0.04 15028 0.04
theta[1] 1 1 1 17072 0.04 17730 0.04 15418 0.04
theta[2] 1 1 1 10355 0.03 17620 0.04 14264 0.04
theta[3] 1 1 1 11243 0.03 18993 0.05 13741 0.03
theta[4] 1 1 1 11852 0.03 18121 0.05 14711 0.04
theta[5] 1 1 1 8373 0.02 18764 0.05 13977 0.03
theta[6] 1 1 1 9961 0.02 17788 0.04 13662 0.03
theta[7] 1 1 1 12276 0.03 17502 0.04 16218 0.04
theta[8] 1 1 1 16739 0.04 18122 0.05 15442 0.04
lp__ 1 1 1 6866 0.02 13296 0.03 16026 0.04
Rhat, Bulk-ESS and Tail-ESS are not able to detect problems, although relative Bulk-ESS for tau is very small 0.5%.
plot_local_ess(fit = fit_cp3, par = "tau", nalpha = 100)
plot_quantile_ess(fit = fit_cp3, par = "tau", nalpha = 100)
And the rank plots of tau also show sticking and mixing problems for small values of tau.
samp_cp3 <- as.array(fit_cp3)
mcmc_hist_r_scale(samp_cp3[, , "tau"])
What we do see is an advantage of rank plots over trace plots as even with 100000 draws per chain, rank plots don’t get crowded and the mixing problems of chains is still easy to see.
With centered parameterization the mean estimate tends to get smaller with more draws. With 400000 draws using the centered parameterization the mean estimate is 3.77 (se 0.03). With 40000 draws using the non-centered parameterization the mean estimate is 3.6 (se 0.02). The difference is more than 8 sigmas. We are able to see the convergence problems in the centered parameterization case, if we do look carefully (or use divergence diagnostic ), but we do see that Rhat, Bulk-ESS, Tail-ESS and Monte Carlo error estimates for the mean can’t be trusted if other diagnostics indicate convergence problems!
In the following, we want to expand our understanding of the non-centered parameterization of the hierarchical model fit to the eight schools data.
writeLines(readLines("eight_schools_ncp.stan"))
data {
int<lower=0> J;
real y[J];
real<lower=0> sigma[J];
}
parameters {
real mu;
real<lower=0> tau;
real theta_tilde[J];
}
transformed parameters {
real theta[J];
for (j in 1:J)
theta[j] = mu + tau * theta_tilde[j];
}
model {
mu ~ normal(0, 5);
tau ~ cauchy(0, 5);
theta_tilde ~ normal(0, 1);
y ~ normal(theta, sigma);
}
In the main text, we have already seen that the non-centered parameterization works better than the centered parameterization, at least when we use an increased adapt_delta value. Let’s see what happens when using the default MCMC option of Stan.
fit_ncp <- stan(
file = 'eight_schools_ncp.stan', data = eight_schools,
iter = 2000, chains = 4, seed = 483892929, refresh = 0
)
We observe a few divergent transitions with the default of adapt_delta=0.8. Let’s analyze the sample.
res <- monitor_extra(fit_ncp)
print(res)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
mean se_mean sd Q5 Q50 Q95 seff reff sseff zseff zsseff zsreff Rhat sRhat
mu 4.41 0.05 3.33 -1.13 4.49 9.74 4646 1.16 4709 4682 4746 1.19 1 1
tau 3.50 0.06 3.08 0.27 2.67 9.66 2857 0.71 2880 2529 2563 0.64 1 1
theta_tilde[1] 0.31 0.01 1.01 -1.38 0.33 1.96 5241 1.31 5257 5236 5252 1.31 1 1
theta_tilde[2] 0.09 0.01 0.94 -1.47 0.11 1.63 4830 1.21 4872 4883 4927 1.23 1 1
theta_tilde[3] -0.07 0.01 0.98 -1.67 -0.09 1.54 5687 1.42 5754 5674 5741 1.44 1 1
theta_tilde[4] 0.04 0.01 0.94 -1.49 0.03 1.59 4794 1.20 4845 4793 4846 1.21 1 1
theta_tilde[5] -0.16 0.01 0.92 -1.65 -0.17 1.38 4286 1.07 4357 4285 4355 1.09 1 1
theta_tilde[6] -0.09 0.01 0.94 -1.64 -0.11 1.45 4025 1.01 4058 4031 4063 1.02 1 1
theta_tilde[7] 0.33 0.01 0.96 -1.25 0.35 1.87 4600 1.15 4685 4596 4681 1.17 1 1
theta_tilde[8] 0.08 0.01 0.98 -1.50 0.08 1.63 5147 1.29 5194 5163 5204 1.30 1 1
theta[1] 6.12 0.08 5.53 -1.54 5.49 15.80 4262 1.07 4289 4532 4567 1.14 1 1
theta[2] 4.89 0.06 4.67 -2.69 4.83 12.60 5297 1.32 5305 5577 5586 1.40 1 1
theta[3] 4.00 0.08 5.15 -4.50 4.18 11.89 4063 1.02 4141 4375 4465 1.12 1 1
theta[4] 4.69 0.08 4.73 -2.82 4.61 12.21 3850 0.96 3973 4064 4209 1.05 1 1
theta[5] 3.70 0.07 4.57 -4.03 3.96 10.70 4371 1.09 4471 4554 4656 1.16 1 1
theta[6] 3.99 0.07 4.75 -4.12 4.17 11.45 4995 1.25 5000 5235 5241 1.31 1 1
theta[7] 6.26 0.08 5.04 -0.83 5.80 15.09 3947 0.99 3972 4029 4055 1.01 1 1
theta[8] 4.87 0.08 5.21 -3.15 4.81 13.49 4537 1.13 4536 4799 4817 1.20 1 1
lp__ -6.97 0.06 2.33 -11.33 -6.65 -3.77 1425 0.36 1462 1485 1525 0.38 1 1
zRhat zsRhat zfsRhat zfsseff zfsreff medsseff medsreff madsseff madsreff
mu 1 1 1.00 2017 0.50 4922 1.23 2143 0.54
tau 1 1 1.00 3010 0.75 3342 0.84 2936 0.73
theta_tilde[1] 1 1 1.00 1979 0.49 4884 1.22 2293 0.57
theta_tilde[2] 1 1 1.00 2048 0.51 4904 1.23 2124 0.53
theta_tilde[3] 1 1 1.00 1877 0.47 5215 1.30 2229 0.56
theta_tilde[4] 1 1 1.00 2167 0.54 4697 1.17 2463 0.62
theta_tilde[5] 1 1 1.01 2074 0.52 3959 0.99 2206 0.55
theta_tilde[6] 1 1 1.00 2001 0.50 4419 1.10 2532 0.63
theta_tilde[7] 1 1 1.00 1969 0.49 4776 1.19 2522 0.63
theta_tilde[8] 1 1 1.00 1813 0.45 4821 1.21 2193 0.55
theta[1] 1 1 1.00 2611 0.65 4468 1.12 2695 0.67
theta[2] 1 1 1.00 2318 0.58 5198 1.30 2641 0.66
theta[3] 1 1 1.00 2486 0.62 5213 1.30 2625 0.66
theta[4] 1 1 1.00 2273 0.57 4908 1.23 2558 0.64
theta[5] 1 1 1.00 2250 0.56 5000 1.25 2596 0.65
theta[6] 1 1 1.00 2243 0.56 5031 1.26 2414 0.60
theta[7] 1 1 1.00 2551 0.64 4630 1.16 2828 0.71
theta[8] 1 1 1.00 2224 0.56 4627 1.16 2587 0.65
lp__ 1 1 1.00 2456 0.61 2107 0.53 3014 0.75
All Rhats are close to 1, and ESSs are good despite a few divergent transitions. Small interval and quantile plots of tau reveal some sampling problems for small tau values, but not nearly as strong as for the centered parameterization.
plot_local_ess(fit = fit_ncp, par = "tau", nalpha = 20)
plot_quantile_ess(fit = fit_ncp, par = "tau", nalpha = 40)
Overall, the non-centered parameterization looks good even for the default settings of adapt_delta, and increasing it to 0.95 gets rid of the last remaining problems. This stands in sharp contrast to what we observed for the centered parameterization, where increasing adapt_delta didn’t help at all. Actually, this is something we observe quite often: A suboptimal parameterization can cause problems that are not simply solved by tuning the sampler. Instead, we have to adjust our model to achieve trustworthy inference.
We have already seen that the effective sample size of dynamic HMC can be higher than with independent draws. The next example illustrates interesting relative efficiency phenomena due to the properties of dynamic HMC algorithms.
We sample from a simple 16-dimensional standard normal model.
writeLines(readLines("normal.stan"))
data {
int<lower=1> J;
}
parameters {
vector[J] x;
}
model {
x ~ normal(0, 1);
}
fit_n <- stan(
file = 'normal.stan', data = data.frame(J = 16),
iter = 20000, chains = 4, seed = 483892929, refresh = 0
)
samp <- as.array(fit_n)
res <- monitor_extra(samp)
print(res)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):
mean se_mean sd Q5 Q50 Q95 seff reff sseff zseff zsseff zsreff Rhat sRhat
x[1] 0.00 0.00 0.99 -1.62 0.00 1.63 96791 2.42 97003 96771 96981 2.42 1 1
x[2] 0.00 0.00 1.00 -1.64 0.01 1.66 104176 2.60 104353 104233 104409 2.61 1 1
x[3] 0.00 0.00 1.00 -1.66 0.00 1.64 92850 2.32 93370 92782 93302 2.33 1 1
x[4] 0.00 0.00 1.01 -1.65 0.00 1.66 101100 2.53 101428 101293 101622 2.54 1 1
x[5] 0.00 0.00 1.00 -1.64 0.00 1.64 101814 2.55 102625 101855 102670 2.57 1 1
x[6] 0.00 0.00 1.00 -1.65 0.01 1.64 92945 2.32 93713 92968 93734 2.34 1 1
x[7] -0.01 0.00 0.99 -1.65 -0.01 1.63 97394 2.43 97642 97353 97602 2.44 1 1
x[8] 0.00 0.00 1.00 -1.63 0.00 1.64 99741 2.49 99949 99688 99896 2.50 1 1
x[9] 0.00 0.00 1.00 -1.64 0.00 1.65 107571 2.69 108141 107561 108129 2.70 1 1
x[10] 0.00 0.00 0.99 -1.63 0.00 1.63 98812 2.47 99128 98838 99153 2.48 1 1
x[11] -0.01 0.00 1.00 -1.65 0.00 1.66 101379 2.53 101861 101299 101781 2.54 1 1
x[12] 0.00 0.00 1.00 -1.64 0.00 1.65 98780 2.47 99163 98819 99202 2.48 1 1
x[13] 0.01 0.00 1.00 -1.65 0.01 1.66 102158 2.55 102428 102104 102374 2.56 1 1
x[14] 0.00 0.00 1.00 -1.64 0.00 1.65 97052 2.43 97341 97223 97512 2.44 1 1
x[15] 0.00 0.00 1.01 -1.66 0.00 1.66 100055 2.50 100178 100000 100121 2.50 1 1
x[16] 0.00 0.00 1.00 -1.65 0.00 1.65 97495 2.44 97773 97510 97788 2.44 1 1
lp__ -7.99 0.02 2.85 -13.17 -7.67 -3.95 14445 0.36 14453 14188 14195 0.35 1 1
zRhat zsRhat zfsRhat zfsseff zfsreff medsseff medsreff madsseff madsreff
x[1] 1 1 1 16248 0.41 77884 1.95 19350 0.48
x[2] 1 1 1 15968 0.40 80243 2.01 19615 0.49
x[3] 1 1 1 16243 0.41 76082 1.90 18875 0.47
x[4] 1 1 1 16328 0.41 81090 2.03 19047 0.48
x[5] 1 1 1 16344 0.41 80858 2.02 19267 0.48
x[6] 1 1 1 16759 0.42 80715 2.02 19105 0.48
x[7] 1 1 1 16179 0.40 79227 1.98 19179 0.48
x[8] 1 1 1 16874 0.42 81168 2.03 19297 0.48
x[9] 1 1 1 16680 0.42 77215 1.93 19715 0.49
x[10] 1 1 1 16923 0.42 77581 1.94 19481 0.49
x[11] 1 1 1 16374 0.41 79409 1.99 18625 0.47
x[12] 1 1 1 16865 0.42 78582 1.96 19363 0.48
x[13] 1 1 1 15946 0.40 77579 1.94 19711 0.49
x[14] 1 1 1 17369 0.43 79292 1.98 20006 0.50
x[15] 1 1 1 16772 0.42 80536 2.01 19647 0.49
x[16] 1 1 1 16817 0.42 78882 1.97 19653 0.49
lp__ 1 1 1 21109 0.53 16373 0.41 23827 0.60
The Bulk-ESS for all \(x\) is larger than 9.330210^{4}. However tail-ESS for all \(x\) is less than 1.736910^{4}. Further, bulk-ESS for lp__ is only 1.419510^{4}.
If we take a look at all the Stan examples in this notebook, we see that the bulk-ESS for lp__ is always below 0.5. This is because lp__ correlates strongly with the total energy in HMC, which is sampled using a random walk proposal once per iteration. Thus, it’s likely that lp__ has some random walk behavior, as well, leading to autocorrelation and a small relative efficiency. At the same time, adaptive HMC can create antithetic Markov chains which have negative auto-correlations at odd lags. This results in a bulk-ESS greater than S for some parameters.
Let’s check the effective sample size in different parts of the posterior by computing the effective sample size for small interval estimates for x[1].
plot_local_ess(fit_n, par = 1, nalpha = 100)
The effective sample size for probability estimate for a small interval is close to 1 with a slight drop in the tails. This is a good result, but far from the effective sample size for the bulk, mean, and median estimates. Let’s check the effective sample size for quantiles.
plot_quantile_ess(fit = fit_n, par = 1, nalpha = 100)
Central quantile estimates have higher effective sample size than tail quantile estimates.
The total energy of HMC should affect how far in the tails a chain in one iteration can go. Fat tails of the target have high energy, and thus only chains with high total energy can reach there. This will suggest that the random walk in total energy would cause random walk in the variance of \(x\). Let’s check the second moment of \(x\).
samp_x2 <- as.array(fit_n, pars = "x")^2
res <- monitor_extra(samp_x2)
print(res)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):
mean se_mean sd Q5 Q50 Q95 seff reff sseff zseff zsseff zsreff Rhat sRhat zRhat zsRhat
x[1] 0.99 0.01 1.41 0 0.45 3.81 15003 0.38 15032 16230 16251 0.41 1 1 1 1
x[2] 1.00 0.01 1.42 0 0.44 3.92 14175 0.35 14211 15949 15983 0.40 1 1 1 1
x[3] 1.00 0.01 1.42 0 0.45 3.89 14953 0.37 14953 16228 16236 0.41 1 1 1 1
x[4] 1.02 0.01 1.46 0 0.46 3.95 14416 0.36 14426 16327 16334 0.41 1 1 1 1
x[5] 0.99 0.01 1.41 0 0.44 3.80 14113 0.35 14122 16320 16334 0.41 1 1 1 1
x[6] 1.00 0.01 1.39 0 0.46 3.82 15467 0.39 15478 16725 16741 0.42 1 1 1 1
x[7] 0.99 0.01 1.39 0 0.44 3.82 15273 0.38 15291 16146 16160 0.40 1 1 1 1
x[8] 1.00 0.01 1.40 0 0.46 3.80 15859 0.40 15854 16891 16879 0.42 1 1 1 1
x[9] 1.00 0.01 1.42 0 0.45 3.84 14856 0.37 14874 16669 16685 0.42 1 1 1 1
x[10] 0.98 0.01 1.40 0 0.44 3.80 15182 0.38 15183 16894 16898 0.42 1 1 1 1
x[11] 1.01 0.01 1.40 0 0.47 3.86 14766 0.37 14764 16368 16353 0.41 1 1 1 1
x[12] 1.00 0.01 1.44 0 0.46 3.88 15039 0.38 15047 16857 16870 0.42 1 1 1 1
x[13] 1.01 0.01 1.41 0 0.46 3.84 14281 0.36 14284 16016 16048 0.40 1 1 1 1
x[14] 0.99 0.01 1.40 0 0.45 3.82 15836 0.40 15839 17373 17376 0.43 1 1 1 1
x[15] 1.01 0.01 1.44 0 0.45 3.90 15041 0.38 15042 16783 16784 0.42 1 1 1 1
x[16] 1.01 0.01 1.43 0 0.45 3.87 15063 0.38 15069 16778 16795 0.42 1 1 1 1
zfsRhat zfsseff zfsreff medsseff medsreff madsseff madsreff
x[1] 1 19185 0.48 19329 0.48 23759 0.59
x[2] 1 18862 0.47 19648 0.49 23880 0.60
x[3] 1 19526 0.49 18845 0.47 24924 0.62
x[4] 1 17976 0.45 19046 0.48 23027 0.58
x[5] 1 18613 0.47 19219 0.48 23699 0.59
x[6] 1 19530 0.49 19137 0.48 24473 0.61
x[7] 1 19258 0.48 19190 0.48 24176 0.60
x[8] 1 19610 0.49 19292 0.48 24890 0.62
x[9] 1 19020 0.48 19667 0.49 23809 0.60
x[10] 1 19825 0.50 19514 0.49 24381 0.61
x[11] 1 18992 0.47 18618 0.47 24285 0.61
x[12] 1 19392 0.48 19383 0.48 25018 0.63
x[13] 1 19233 0.48 19707 0.49 24289 0.61
x[14] 1 19916 0.50 20016 0.50 24604 0.62
x[15] 1 19255 0.48 19637 0.49 24952 0.62
x[16] 1 19579 0.49 19718 0.49 24354 0.61
The mean of the bulk-ESS for \(x_j^2\) is 1.65454410^{4}, which is quite close to the bulk-ESS for lp__. This is not that surprising as the potential energy in normal model is proportional to \(\sum_{j=1}^J x_j^2\).
Let’s check the effective sample size in different parts of the posterior by computing the effective sample size for small interval probability estimates for x[1]^2.
plot_local_ess(fit = samp_x2, par = 1, nalpha = 100)
The effective sample size is mostly a bit below 1, but for the right tail of \(x_1^2\) the effective sample size drops. This is likely due to only some iterations having high enough total energy to obtain draws from the high energy part of the tail. Let’s check the effective sample size for quantiles.
plot_quantile_ess(fit = samp_x2, par = 1, nalpha = 100)
We can see the correlation between lp__ and magnitude of x[1] in the following plot.
samp <- as.array(fit_n)
qplot(
as.vector(samp[, , "lp__"]),
abs(as.vector(samp[, , "x[1]"]))
) +
labs(x = 'lp__', y = 'x[1]')
Low lp__ values corresponds to high energy and more variation in x[1], and high lp__ corresponds to low energy and small variation in x[1]. Finally \(\sum_{j=1}^J x_j^2\) is perfectly correlated with lp__.
qplot(
as.vector(samp[, , "lp__"]),
as.vector(apply(samp[, , 1:16]^2, 1:2, sum))
) +
labs(x = 'lp__', y = 'sum(x^2)')
This shows that even if we get high effective sample size estimates for central quantities (like mean or median), it is important to look at the relative efficiency of scale and tail quantities, as well. The effective sample size of lp__ can also indicate problems of sampling in the tails.
makevars <- file.path(Sys.getenv("HOME"), ".R/Makevars")
if (file.exists(makevars)) {
writeLines(readLines(makevars))
}
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function
CXXFLAGS+=-flto -ffat-lto-objects -Wno-unused-local-typedefs
CXXFLAGS+=-std=c++11
CFLAGS+=-O3
devtools::session_info("rstan")
─ Session info ───────────────────────────────────────────────────────────────────────────────────
setting value
version R version 3.5.1 (2018-07-02)
os Ubuntu 16.04.5 LTS
system x86_64, linux-gnu
ui X11
language en_GB:en
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2018-12-08
─ Packages ───────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.1)
backports 1.1.2 2017-12-13 [1] CRAN (R 3.5.1)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.5.1)
BH 1.66.0-1 2018-02-13 [1] CRAN (R 3.5.1)
callr 3.0.0 2018-08-24 [1] CRAN (R 3.5.1)
cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.1)
colorspace 1.3-2 2016-12-14 [1] CRAN (R 3.5.1)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.1)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.1)
digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.1)
fansi 0.4.0 2018-10-05 [1] CRAN (R 3.5.1)
ggplot2 * 3.1.0 2018-10-25 [1] CRAN (R 3.5.1)
glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.1)
gridExtra * 2.3 2017-09-09 [1] CRAN (R 3.5.1)
gtable 0.2.0 2016-02-26 [1] CRAN (R 3.5.1)
inline 0.3.15 2018-05-18 [1] CRAN (R 3.5.1)
labeling 0.3 2014-08-23 [1] CRAN (R 3.5.1)
lattice 0.20-38 2018-11-04 [1] CRAN (R 3.5.1)
lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.5.1)
loo 2.0.0 2018-04-11 [1] CRAN (R 3.5.1)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.1)
MASS 7.3-51.1 2018-11-01 [1] CRAN (R 3.5.1)
Matrix 1.2-15 2018-11-01 [1] CRAN (R 3.5.1)
matrixStats 0.54.0 2018-07-23 [1] CRAN (R 3.5.1)
mgcv 1.8-25 2018-10-26 [1] CRAN (R 3.5.1)
munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.1)
nlme 3.1-137 2018-04-07 [1] CRAN (R 3.5.1)
pillar 1.3.0 2018-07-14 [1] CRAN (R 3.5.1)
pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.1)
plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.1)
prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.1)
processx 3.2.0 2018-08-16 [1] CRAN (R 3.5.1)
ps 1.2.1 2018-11-06 [1] CRAN (R 3.5.1)
R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.1)
RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 3.5.1)
Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.1)
RcppEigen 0.3.3.4.0 2018-02-07 [1] CRAN (R 3.5.1)
reshape2 1.4.3 2017-12-11 [1] CRAN (R 3.5.1)
rlang 0.3.0.1 2018-10-25 [1] CRAN (R 3.5.1)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.1)
rstan * 2.18.2 2018-11-07 [1] CRAN (R 3.5.1)
scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.1)
StanHeaders * 2.18.0 2018-10-07 [1] CRAN (R 3.5.1)
stringi 1.2.4 2018-07-20 [1] CRAN (R 3.5.1)
stringr * 1.3.1 2018-05-10 [1] CRAN (R 3.5.1)
tibble * 1.4.2 2018-01-22 [1] CRAN (R 3.5.1)
utf8 1.1.4 2018-05-24 [1] CRAN (R 3.5.1)
viridisLite 0.3.0 2018-02-01 [1] CRAN (R 3.5.1)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.1)
[1] /u/77/ave/unix/R/x86_64-pc-linux-gnu-library/3.5
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library